Outlines

Abstract

Education plays a vital role in our life. Knowing what factors can affect student’s performance on test scores would be helpful to educators. We want to explore the “Student Grade Prediction” data set from Kaggle to understand the influence of the parent’s background, test preparation and other factors on student’s performance. There are three questions we are interested in:
1. What are the fundamental factors that will affect students’ performance on their final grade?
2. Which factors influence poor performance on the final grade the most?
3. What would be the best way to improve student scores on their final grade?

Describe attribute types for each attribute in the data set

# The data contains of a number of the following fields:
# school      Student’s school: 'GP' or 'MS' 
# Sex         ‘F’ – Female or ‘M’ – Male
# Age         ’15 – 22’
# Address     home address: ‘U’ – urban or ‘R’ – rural
# Famsize     ‘LE3’ – Less than/equal to 3 or ‘GT3’ – Greater than 3
# Pstatus     Parent’s cohabitation status: ‘T’ – Living together or ‘A’ – Apart
# Medu        education: ‘0’ – None, ‘1’ – Primary, ‘2’ – 5th to 9th grade, ‘3’ – Secondary, or ‘4’ – Higher education
# Fedu        education: ‘0’ – None, ‘1’ – Primary, ‘2’ – 5th to 9th grade, ‘3’ – Secondary, or ‘4’ – Higher education
# Mjob        ‘teacher’, ‘health’ care related, civil ‘services’ (administrative or police), ‘at_home’, or ‘other’
# Fjob        ‘teacher’, ‘health’ care related, civil ‘services’ (administrative or police), ‘at_home’, or ‘other’
# Reason      Reason to choose this school: close to ‘home’, school ‘reputation’, ‘course’ preference, or ‘other’
# Guardian    Student’s guardian: ‘mother’, ‘father’, or ‘other’
# Traveltime  Home to school travel time:  ‘1’ - <15 min, ‘2’ – 15 to 30 min, ‘3’ – 30 min to 1 hr, ‘4’ - >1 hr
# Studytime   Weekly study time:  ‘1’ - <2 hours, ‘2’ – 2 to 5 hours, ‘3’ – 5 to 10 hours, ‘4’ - >10 hours
# Failures    Number of past class failures: ‘n’ if 1 < n <3, else 4
# Schoolsup   Extra educational support: ‘yes’ or ‘no’
# Famsup      Family educational support: ‘yes’ or ‘no’
# Paid        Extra paid classes within the course subject: ‘yes’ or ‘no’
# Activities  Extra-curricular activities: ‘yes’ or ‘no’
# Nursery     Attend nursery school: ‘yes’ or ‘no’
# Higher      Wants to take higher education: ‘yes’ or ‘no’
# Internet    Internet access at home: ‘yes’ or ‘no’
# Romantic    With a romantic relationship: ‘yes’ or ‘no’
# Famrel      Quality of family relationships: (1 to 5) 1 – very bad   5 – excellent
# Freetime    Free time after school: (1 to 5)
# Goout       Going out with friends: (1 to 5)
# Dalc        Workday alcohol consumption: (1 to 5)
# Walc        Weekend alcohol consumption: (1 to 5)
# Health      Current health status: (1 to 5)
# Absences    Number of school absences: 0 to 93
# G1          First period grade: 0 to 20
# G2          Second period grade: 0 to 20
# G3          Final grade: 0 to 20 (output target)

Data and Data Processing

Are there and missing values?

colSums(is.na(math))
##     school        sex        age    address    famsize    Pstatus 
##          0          0          0          0          0          0 
##       Medu       Fedu       Mjob       Fjob     reason   guardian 
##          0          0          0          0          0          0 
## traveltime  studytime   failures  schoolsup     famsup       paid 
##          0          0          0          0          0          0 
## activities    nursery     higher   internet   romantic     famrel 
##          0          0          0          0          0          0 
##   freetime      goout       Dalc       Walc     health   absences 
##          0          0          0          0          0          0 
##         G1         G2         G3 
##          0          0          0

Structure of the dataset

str(math)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 395 obs. of  33 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : num  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr  "U" "U" "U" "U" ...
##  $ famsize   : chr  "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr  "A" "T" "T" "T" ...
##  $ Medu      : num  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : num  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : chr  "at_home" "at_home" "at_home" "health" ...
##  $ Fjob      : chr  "teacher" "other" "other" "services" ...
##  $ reason    : chr  "course" "course" "other" "home" ...
##  $ guardian  : chr  "mother" "father" "mother" "mother" ...
##  $ traveltime: num  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : num  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : num  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup : chr  "yes" "no" "yes" "no" ...
##  $ famsup    : chr  "no" "yes" "no" "yes" ...
##  $ paid      : chr  "no" "no" "yes" "yes" ...
##  $ activities: chr  "no" "no" "no" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "no" "yes" "yes" "yes" ...
##  $ romantic  : chr  "no" "no" "no" "yes" ...
##  $ famrel    : num  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : num  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : num  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : num  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : num  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : num  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : num  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1        : num  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : num  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : num  6 6 10 15 10 15 11 6 19 15 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   school = col_character(),
##   ..   sex = col_character(),
##   ..   age = col_double(),
##   ..   address = col_character(),
##   ..   famsize = col_character(),
##   ..   Pstatus = col_character(),
##   ..   Medu = col_double(),
##   ..   Fedu = col_double(),
##   ..   Mjob = col_character(),
##   ..   Fjob = col_character(),
##   ..   reason = col_character(),
##   ..   guardian = col_character(),
##   ..   traveltime = col_double(),
##   ..   studytime = col_double(),
##   ..   failures = col_double(),
##   ..   schoolsup = col_character(),
##   ..   famsup = col_character(),
##   ..   paid = col_character(),
##   ..   activities = col_character(),
##   ..   nursery = col_character(),
##   ..   higher = col_character(),
##   ..   internet = col_character(),
##   ..   romantic = col_character(),
##   ..   famrel = col_double(),
##   ..   freetime = col_double(),
##   ..   goout = col_double(),
##   ..   Dalc = col_double(),
##   ..   Walc = col_double(),
##   ..   health = col_double(),
##   ..   absences = col_double(),
##   ..   G1 = col_double(),
##   ..   G2 = col_double(),
##   ..   G3 = col_double()
##   .. )

Summary

summary(math)
##     school              sex                 age         address         
##  Length:395         Length:395         Min.   :15.0   Length:395        
##  Class :character   Class :character   1st Qu.:16.0   Class :character  
##  Mode  :character   Mode  :character   Median :17.0   Mode  :character  
##                                        Mean   :16.7                     
##                                        3rd Qu.:18.0                     
##                                        Max.   :22.0                     
##    famsize            Pstatus               Medu            Fedu      
##  Length:395         Length:395         Min.   :0.000   Min.   :0.000  
##  Class :character   Class :character   1st Qu.:2.000   1st Qu.:2.000  
##  Mode  :character   Mode  :character   Median :3.000   Median :2.000  
##                                        Mean   :2.749   Mean   :2.522  
##                                        3rd Qu.:4.000   3rd Qu.:3.000  
##                                        Max.   :4.000   Max.   :4.000  
##      Mjob               Fjob              reason         
##  Length:395         Length:395         Length:395        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##    guardian           traveltime      studytime        failures     
##  Length:395         Min.   :1.000   Min.   :1.000   Min.   :0.0000  
##  Class :character   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.0000  
##  Mode  :character   Median :1.000   Median :2.000   Median :0.0000  
##                     Mean   :1.448   Mean   :2.035   Mean   :0.3342  
##                     3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:0.0000  
##                     Max.   :4.000   Max.   :4.000   Max.   :3.0000  
##   schoolsup            famsup              paid          
##  Length:395         Length:395         Length:395        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##   activities          nursery             higher         
##  Length:395         Length:395         Length:395        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##    internet           romantic             famrel         freetime    
##  Length:395         Length:395         Min.   :1.000   Min.   :1.000  
##  Class :character   Class :character   1st Qu.:4.000   1st Qu.:3.000  
##  Mode  :character   Mode  :character   Median :4.000   Median :3.000  
##                                        Mean   :3.944   Mean   :3.235  
##                                        3rd Qu.:5.000   3rd Qu.:4.000  
##                                        Max.   :5.000   Max.   :5.000  
##      goout            Dalc            Walc           health     
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:3.000  
##  Median :3.000   Median :1.000   Median :2.000   Median :4.000  
##  Mean   :3.109   Mean   :1.481   Mean   :2.291   Mean   :3.554  
##  3rd Qu.:4.000   3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:5.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##     absences            G1              G2              G3       
##  Min.   : 0.000   Min.   : 3.00   Min.   : 0.00   Min.   : 0.00  
##  1st Qu.: 0.000   1st Qu.: 8.00   1st Qu.: 9.00   1st Qu.: 8.00  
##  Median : 4.000   Median :11.00   Median :11.00   Median :11.00  
##  Mean   : 5.709   Mean   :10.91   Mean   :10.71   Mean   :10.42  
##  3rd Qu.: 8.000   3rd Qu.:13.00   3rd Qu.:13.00   3rd Qu.:14.00  
##  Max.   :75.000   Max.   :19.00   Max.   :19.00   Max.   :20.00

Need to change the factor attributes to numeric for Clustering Analysis

math.num <- math %>% 
#School: 0 = GP - Gabriel Pereira,  1 = MS – Mousinho da Silveira
  mutate(school = ifelse(school == "GP", 0 ,1)) %>% 

# sex: 0 = F, 1 = M 
  mutate(sex = ifelse(sex == "F", 0, 1 )) %>% 
  
# address: 0 = U - Urban, 1 = R - Rural
  mutate(address = ifelse(address == "U", 0, 1)) %>% 
  
# famsize: 0 = LE3 - Less than or equal to 3, 1 = GT3 - Greater than 3    
  mutate(famsize = ifelse(famsize == "LE3",0, 1)) %>% 
  
# Pstatus: 0 = T - Living together, 1 = A - Living apart
  mutate(Pstatus = ifelse(Pstatus == "T", 0, 1)) %>% 

# Mjob: 0 = 'Teacher',  1 = 'Health' care related, 2 = Civil 'services', 3 = 'at_home', 4 = 'other'
  mutate(Mjob=ifelse(Mjob=="teacher",0,
                         ifelse(Mjob=="health", 1, 
                                ifelse(Mjob=="services", 2,
                                       ifelse(Mjob=="at_home", 3, 4))))) %>%

# Fjob: 0 = 'Teacher',  1 = 'Health' care related, 2 = Civil 'services', 3 = 'at_home', 4 = 'other'
  mutate(Fjob=ifelse(Fjob=="teacher",0,
                         ifelse(Fjob=="health", 1, 
                                ifelse(Fjob=="services", 2,
                                       ifelse(Fjob=="at_home", 3, 4)))))  %>%

# reason: 0 = Close to 'home', 1 = school 'reputation', 2 = 'course' preference, 3 = other
  mutate(reason=ifelse(reason=="home",0,
                         ifelse(reason=="reputation", 1, 
                                ifelse(reason=="course", 2,3))))  %>% 

# guardian: 0 = mother, 1 = father, 2 = other
  mutate(guardian=ifelse(guardian=="mother",0,
                         ifelse(guardian=="father", 1, 2))) %>%

# schoolsup: 0 = 'no', 1 = 'yes'
  mutate(schoolsup = ifelse(schoolsup == "no", 0, 1)) %>% 
  
# famsup: 0 = 'no', 1 = 'yes'
  mutate(famsup = ifelse(famsup == "no", 0, 1)) %>% 

# paid: 0 = 'no', 1 = 'yes'
  mutate(paid = ifelse(paid == "no", 0, 1)) %>% 
  
# activities: 0 = 'no', 1 = 'yes'
  mutate(activities = ifelse(activities == "no", 0, 1)) %>% 

# nursery: 0 = 'no', 1 = 'yes'
  mutate(nursery = ifelse(nursery == "no", 0 , 1)) %>% 

# higher: 0 = 'no', 1 = 'yes'
  mutate(higher = ifelse(higher == "no", 0, 1)) %>% 

# internet: 0 = 'no', 1 = 'yes'
  mutate(internet = ifelse(internet == "no", 0, 1)) %>% 

# romantic: 0 = 'no', 1 = 'yes'
  mutate(romantic = ifelse(romantic == "no", 0, 1))

Convert nominal from character to numeric.

# To use this method, the above replacement must not use gsub() -BW
math.num <- math.num %>% 
  mutate_if(is.character, funs(as.numeric))

Discretize numeric variables for Association Rule Minging

Bin the grades in G1, G2, G3 to A [16, 20], B [14, 16), C [12, 14), D[10, 12), F [0, 10)

math.assoc <- math %>% 
  mutate(G1_bin = ifelse(
    G1 < 10, "F", ifelse(
      G1 < 12, "D", ifelse(
        G1 < 14, "C", ifelse(
          G1 < 16, "B", "A"  )  )  )  )  ) %>%
  mutate(G2_bin = ifelse(
    G2 < 10, "F", ifelse(
      G2 < 12, "D", ifelse(
        G2 < 14, "C", ifelse(
          G2 < 16, "B", "A"  )  )  )  )  ) %>%
  mutate(G3_bin = ifelse(
    G3 < 10, "F", ifelse(
      G3 < 12, "D", ifelse(
        G3 < 14, "C", ifelse(
          G3 < 16, "B", "A"  )  )  )  )  )

Convert data type to nominal for performing association rule mining

math.assoc <- math.assoc %>%
  mutate_all(funs(as.factor))

#check datatype
glimpse(math.assoc)
## Observations: 395
## Variables: 36
## $ school     <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, G…
## $ sex        <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, F,…
## $ age        <fct> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 1…
## $ address    <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,…
## $ famsize    <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, GT3…
## $ Pstatus    <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, T,…
## $ Medu       <fct> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3,…
## $ Fedu       <fct> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3,…
## $ Mjob       <fct> at_home, at_home, at_home, health, other, services, o…
## $ Fjob       <fct> teacher, other, other, services, other, other, other,…
## $ reason     <fct> course, course, other, home, home, reputation, home, …
## $ guardian   <fct> mother, father, mother, mother, father, mother, mothe…
## $ traveltime <fct> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3,…
## $ studytime  <fct> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2,…
## $ failures   <fct> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ schoolsup  <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, no…
## $ famsup     <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes, y…
## $ paid       <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, no…
## $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes, y…
## $ nursery    <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, yes,…
## $ higher     <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes…
## $ internet   <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes, y…
## $ romantic   <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no, …
## $ famrel     <fct> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5,…
## $ freetime   <fct> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3,…
## $ goout      <fct> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2,…
## $ Dalc       <fct> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ Walc       <fct> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1,…
## $ health     <fct> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4,…
## $ absences   <fct> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0, 4, 6, …
## $ G1         <fct> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14, 10, 14…
## $ G2         <fct> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14, 10, 16…
## $ G3         <fct> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14, 11, 1…
## $ G1_bin     <fct> F, F, F, B, F, B, C, F, A, B, D, D, B, D, B, B, C, F,…
## $ G2_bin     <fct> F, F, F, B, D, B, C, F, A, B, F, C, B, D, A, B, B, D,…
## $ G3_bin     <fct> F, F, D, B, D, B, D, F, A, B, F, C, B, D, A, B, B, D,…

Exploratory Analysis

Visualization

# Correlation between G1 and G2
math %>% 
  ggplot(aes(x=G1, y=G2)) +
  geom_jitter(aes(col=G3)) +
  scale_color_continuous(low = "red", high = "blue") +
  xlab("Score on G1") +
  ylab("Score on G2") +
  ggtitle("Plot 0. Scores throughout the year")

# How many students passed (10 - 20) and how many failed? (0 - 9)
math %>% 
  mutate(pass = ifelse(G3 > 9, "pass", "fail")) %>% 
  ggplot(aes(x = pass, fill = pass)) +
  geom_bar() +
  xlab("Result at End of Term") +
  ylab("Number of Students") +
  annotate(geom="text", label = "265", x = 2, y = 250) +
  annotate(geom="text", label = "130", x = 1, y = 115) +
  annotate(geom="text", label = "67.1%", x = 2, y = 235) +
  annotate(geom="text", label = "32.9%", x = 1, y = 100) +
  ggtitle("Plot 1. Students Pass Rate") +
  theme_light()

# School
math.assoc %>%
  ggplot(aes(x=address)) +
  geom_bar(aes(fill=reason)) +
  geom_text(stat = "count", aes(label=stat(count)), vjust=-1) +
  ggtitle("Plot 2. School comparison") +
  facet_wrap(~school) +
  ylim(0,330) +
  theme_light() + 
  geom_text(data=math.assoc %>% group_by(school) %>% tally(), 
            aes(x=1.5, y=330, label = paste0( round(n/sum(n), digits = 2)*100, "%")), 
              vjust = 1.5, size = 5)

# Parent job, education vs grade
#- Mother
math.assoc %>%
  ggplot(aes(x=Mjob, y=G3)) + 
  geom_jitter(aes(col=Medu)) + 
  ggtitle("Plot 3. Mother's education and mother's job", subtitle = "Male vs Female") +
  facet_grid(~sex) +
  ylab("Final Grade") +
  xlab("Job") +
  geom_hline(yintercept = 7) +
  annotate(geom="text", label = "Pass", x = 0.8, y = 7.5) +
  annotate(geom="text", label = "Fail", x = 0.8, y = 6) +
  theme_light()

#- Father
math.assoc %>%
  ggplot(aes(x=Fjob, y=G3)) +
  geom_jitter(aes(col=Fedu)) +
  ggtitle("Plot 4. Father's education and father's job", subtitle = "Male vs Female") +
  facet_wrap(~sex) +
  ylab("Final Grade") +
  xlab("Job") +
  geom_hline(yintercept = 7) +
  annotate(geom="text", label = "Pass", x = 0.8, y = 7.5) +
  annotate(geom="text", label = "Fail", x = 0.8, y = 6) +
  theme_light()

# Failures, school support vs grade
math.assoc %>%
  ggplot(aes(x=failures, y=G3)) +
  geom_jitter(aes(col=sex)) +
  ggtitle("Plot 5. Failures vs Final Grade", subtitle = "School educational support") +
  facet_wrap(~schoolsup) +
  ylab("Final grade") +
  xlab("Number of failures") +
  theme_light()

# Family size, Family support, parent status, family relationship vs grade
math.assoc %>%
  ggplot(aes(x=famsize, y=G3)) +
  geom_jitter(aes(col=famrel, shape=Pstatus)) +
  ggtitle("Plot 6. Family Size vs Final Grade", subtitle = "Family educational support") +
  facet_wrap(~famsup) +
  ylab("Final grade") +
  xlab("Family size") +
  theme_light()

# Goout, freetime, studytime, Walc, Dalc
math.assoc %>%
  ggplot(aes(x=studytime, y=G3)) +
  geom_jitter(aes(col=goout)) +
  ggtitle("Plot 7. Study time vs Final Grade", subtitle = "Free time") +
  facet_grid(~freetime) +
  ylab("Final grade") +
  xlab("Study time") +
  scale_color_discrete(name="Go out") +
  theme_light()

# Absences, activities, vs grade
math.assoc %>%
  ggplot(aes(x=absences, y=G3)) +
  geom_jitter(aes(col=activities)) +
  ggtitle("Plot 8. Absences vs Final Grade") +
  ylab("Final grade") +
  scale_color_discrete(name="Extra-curricular activities") +
  theme_light()

  • Observation:
  • Plot 1:
    • Students from the data set have 67.1% pass rate and 32.9% fail rate
  • Plot 2:
    • School variable could be bias since GP has more students than MS
  • Plot 3 and 4:
    • People who works as a teacher or work in health care field tend to have higher level of education
    • Male with mother working as a teacher tend to get high grade on final grade
    • Male with father working as a teacher tend to get high grade on final grade
  • Plot 5:
    • With school extra education support, it can lower down the number of students getting zero
  • Plot 6:
    • People whose family are apart and have high level of family relationship tend to perform better on final grade
    • Overall, small family has better performance than huge family
  • Plot 7:
    • Students who have middle level of free time and going out tend to get better performance
    • Study more than 10 hours (level 4) does not really have huge impact on the final grade
    • Students who go out a lot have the potential risk of failing on the final grade
  • Plot 8:
    • Having Extra-curricular activities with lower number of absences tend to help student get better performance

Data Modeling

Association rule mining

Create rules from high level

# Remove grade variables from data set
association <- math.assoc %>%
  select(-c(G1,G2,G3))
# Run apriori with setting, sup = 0.05, conf = 0.95, maxlen = 3 
rules <-  apriori(data = association, parameter = list(sup = 0.2, conf = 0.95, maxlen = 3), 
                  control = list(verbose=F))
# lookup the summary of rule
summary(rules)
  • Observation:
  • with the parameter setting, support = 0.05 and confident = 0.95,it has 461 rules
  • Rule length = 3 has 418 rules
  • Rule length = 2 has 43 rules
  • Maximum count = 332
  • Mximum cofidence = 1
  • Maximum support = 0.84
  • Maximum lift = 2.601

Inspect rules from high level

# Sort by confidence and lift to see most relevant rules
rules <- sort(rules, by = c("confidence", "lift", "support"), decreasing = TRUE)

# Redundant rules
inspect(rules[is.redundant(rules)][1:20])
##      lhs                         rhs          support   confidence
## [1]  {age=16,higher=yes}      => {school=GP}  0.2556962 1.0000000 
## [2]  {age=16,Pstatus=T}       => {school=GP}  0.2405063 1.0000000 
## [3]  {age=16,address=U}       => {school=GP}  0.2354430 1.0000000 
## [4]  {age=16,failures=0}      => {school=GP}  0.2227848 1.0000000 
## [5]  {age=16,internet=yes}    => {school=GP}  0.2227848 1.0000000 
## [6]  {age=16,schoolsup=no}    => {school=GP}  0.2177215 1.0000000 
## [7]  {age=16,nursery=yes}     => {school=GP}  0.2126582 1.0000000 
## [8]  {age=15,higher=yes}      => {school=GP}  0.2075949 1.0000000 
## [9]  {age=16,guardian=mother} => {school=GP}  0.2000000 1.0000000 
## [10] {school=GP,age=15}       => {higher=yes} 0.2075949 1.0000000 
## [11] {Pstatus=T,paid=yes}     => {higher=yes} 0.4151899 0.9939394 
## [12] {paid=yes,internet=yes}  => {higher=yes} 0.4075949 0.9938272 
## [13] {failures=0,paid=yes}    => {higher=yes} 0.4000000 0.9937107 
## [14] {schoolsup=no,paid=yes}  => {higher=yes} 0.4000000 0.9937107 
## [15] {paid=yes,nursery=yes}   => {higher=yes} 0.3822785 0.9934211 
## [16] {school=GP,Walc=1}       => {Dalc=1}     0.3544304 0.9929078 
## [17] {famsup=yes,paid=yes}    => {higher=yes} 0.3493671 0.9928058 
## [18] {Pstatus=T,Walc=1}       => {Dalc=1}     0.3291139 0.9923664 
## [19] {failures=0,Walc=1}      => {Dalc=1}     0.3189873 0.9921260 
## [20] {address=U,Walc=1}       => {Dalc=1}     0.3164557 0.9920635 
##      lift     count
## [1]  1.131805 101  
## [2]  1.131805  95  
## [3]  1.131805  93  
## [4]  1.131805  88  
## [5]  1.131805  88  
## [6]  1.131805  86  
## [7]  1.131805  84  
## [8]  1.131805  82  
## [9]  1.131805  79  
## [10] 1.053333  82  
## [11] 1.046949 164  
## [12] 1.046831 161  
## [13] 1.046709 158  
## [14] 1.046709 158  
## [15] 1.046404 151  
## [16] 1.421009 140  
## [17] 1.045755 138  
## [18] 1.420235 130  
## [19] 1.419890 126  
## [20] 1.419801 125
# View the non-redundant rules
inspect(rules[!is.redundant(rules)][1:20])
##      lhs                              rhs          support   confidence
## [1]  {higher=yes,Walc=1}           => {Dalc=1}     0.3645570 1         
## [2]  {romantic=no,Walc=1}          => {Dalc=1}     0.2531646 1         
## [3]  {sex=F,Walc=1}                => {Dalc=1}     0.2379747 1         
## [4]  {activities=yes,Walc=1}       => {Dalc=1}     0.2177215 1         
## [5]  {Fjob=other,Walc=1}           => {Dalc=1}     0.2000000 1         
## [6]  {age=16}                      => {school=GP}  0.2632911 1         
## [7]  {age=15}                      => {school=GP}  0.2075949 1         
## [8]  {school=GP,paid=yes}          => {higher=yes} 0.4075949 1         
## [9]  {address=U,paid=yes}          => {higher=yes} 0.3670886 1         
## [10] {guardian=mother,paid=yes}    => {higher=yes} 0.3316456 1         
## [11] {traveltime=1,activities=yes} => {higher=yes} 0.3316456 1         
## [12] {famsize=GT3,paid=yes}        => {higher=yes} 0.3291139 1         
## [13] {paid=yes,Dalc=1}             => {higher=yes} 0.3088608 1         
## [14] {paid=yes,romantic=no}        => {higher=yes} 0.3037975 1         
## [15] {traveltime=1,paid=yes}       => {higher=yes} 0.3012658 1         
## [16] {sex=F,paid=yes}              => {higher=yes} 0.2734177 1         
## [17] {Fjob=other,paid=yes}         => {higher=yes} 0.2556962 1         
## [18] {studytime=2,paid=yes}        => {higher=yes} 0.2506329 1         
## [19] {sex=F,activities=yes}        => {higher=yes} 0.2430380 1         
## [20] {Medu=4,Dalc=1}               => {higher=yes} 0.2303797 1         
##      lift     count
## [1]  1.431159 144  
## [2]  1.431159 100  
## [3]  1.431159  94  
## [4]  1.431159  86  
## [5]  1.431159  79  
## [6]  1.131805 104  
## [7]  1.131805  82  
## [8]  1.053333 161  
## [9]  1.053333 145  
## [10] 1.053333 131  
## [11] 1.053333 131  
## [12] 1.053333 130  
## [13] 1.053333 122  
## [14] 1.053333 120  
## [15] 1.053333 119  
## [16] 1.053333 108  
## [17] 1.053333 101  
## [18] 1.053333  99  
## [19] 1.053333  96  
## [20] 1.053333  91
# plot the rules by scatterplot
plot(rules, measure = c("support", "lift"), shading = "confidence")

# visualize the grouped matrix on the first 20 rules
plot(rules[1:20], method = "grouped")

# plot the first 20 rules
plot(rules[1:20], method="graph", interactive=FALSE, shading=NA)

# Parallel coordinates plot on the first 20 rules
plot(rules[1:20], method="paracoord", reorder=TRUE)

  • Observation:
  • Rules with shool=GP appears many times
    • It may be bias because the uneven samples on two schools
    • Need to plot a school comparison for validation
  • Rule with higher education appears many times
    • Students at age 15 want to get higher education
    • Students whose mother has higher education level tend to get higher education as well
    • Female wants to get higher education

Relationship with G3

Rules with grade A, sup = 0.01, conf = 0.7, maxlen = 4

rules.G3.A <- 
  apriori(data = association[,-(31:32)], parameter = list(sup = 0.01, conf = 0.7, maxlen = 4), 
          appearance = list(default="lhs",rhs=c("G3_bin=A")),
          control = list(verbose=F))
summary(rules.G3.A)
## set of 30 rules
## 
## rule length distribution (lhs + rhs):sizes
##  3  4 
##  1 29 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   4.000   4.000   3.967   4.000   4.000 
## 
## summary of quality measures:
##     support          confidence          lift           count      
##  Min.   :0.01013   Min.   :0.7143   Min.   :7.054   Min.   :4.000  
##  1st Qu.:0.01013   1st Qu.:0.7143   1st Qu.:7.054   1st Qu.:4.000  
##  Median :0.01266   Median :0.8000   Median :7.900   Median :5.000  
##  Mean   :0.01148   Mean   :0.7824   Mean   :7.726   Mean   :4.533  
##  3rd Qu.:0.01266   3rd Qu.:0.8000   3rd Qu.:7.900   3rd Qu.:5.000  
##  Max.   :0.01266   Max.   :1.0000   Max.   :9.875   Max.   :5.000  
## 
## mining info:
##                     data ntransactions support confidence
##  association[, -(31:32)]           395    0.01        0.7
rules.G3.A <- sort(rules.G3.A, by = c("confidence", "lift", "support"), decreasing = TRUE)
inspect(rules.G3.A[!is.redundant(rules.G3.A)])
##      lhs                 rhs           support confidence     lift count
## [1]  {Mjob=teacher,                                                     
##       reason=course,                                                    
##       freetime=5}     => {G3_bin=A} 0.01265823  1.0000000 9.875000     5
## [2]  {Mjob=teacher,                                                     
##       famsup=no,                                                        
##       freetime=5}     => {G3_bin=A} 0.01012658  1.0000000 9.875000     4
## [3]  {sex=M,                                                            
##       Mjob=teacher,                                                     
##       freetime=5}     => {G3_bin=A} 0.01265823  0.8333333 8.229167     5
## [4]  {Mjob=teacher,                                                     
##       schoolsup=no,                                                     
##       freetime=5}     => {G3_bin=A} 0.01265823  0.8333333 8.229167     5
## [5]  {Medu=4,                                                           
##       reason=course,                                                    
##       freetime=5}     => {G3_bin=A} 0.01265823  0.8333333 8.229167     5
## [6]  {Medu=4,                                                           
##       Mjob=services,                                                    
##       Fjob=teacher}   => {G3_bin=A} 0.01012658  0.8000000 7.900000     4
## [7]  {Mjob=services,                                                    
##       Fjob=teacher,                                                     
##       failures=0}     => {G3_bin=A} 0.01012658  0.8000000 7.900000     4
## [8]  {Fjob=teacher,                                                     
##       reason=course,                                                    
##       activities=yes} => {G3_bin=A} 0.01012658  0.8000000 7.900000     4
## [9]  {Fjob=teacher,                                                     
##       traveltime=1,                                                     
##       famsup=no}      => {G3_bin=A} 0.01012658  0.8000000 7.900000     4
## [10] {Mjob=teacher,                                                     
##       activities=yes,                                                   
##       freetime=5}     => {G3_bin=A} 0.01012658  0.8000000 7.900000     4
## [11] {Mjob=teacher,                                                     
##       paid=no,                                                          
##       freetime=5}     => {G3_bin=A} 0.01012658  0.8000000 7.900000     4
## [12] {Mjob=teacher,                                                     
##       romantic=no,                                                      
##       freetime=5}     => {G3_bin=A} 0.01012658  0.8000000 7.900000     4
## [13] {age=15,                                                           
##       Medu=4,                                                           
##       freetime=5}     => {G3_bin=A} 0.01012658  0.8000000 7.900000     4
## [14] {sex=M,                                                            
##       Pstatus=A,                                                        
##       absences=0}     => {G3_bin=A} 0.01012658  0.8000000 7.900000     4
## [15] {famsize=LE3,                                                      
##       studytime=3,                                                      
##       famrel=5}       => {G3_bin=A} 0.01012658  0.8000000 7.900000     4
## [16] {famsize=LE3,                                                      
##       studytime=3,                                                      
##       Walc=1}         => {G3_bin=A} 0.01012658  0.8000000 7.900000     4
## [17] {famsize=LE3,                                                      
##       studytime=3,                                                      
##       Dalc=1}         => {G3_bin=A} 0.01012658  0.8000000 7.900000     4
## [18] {Medu=4,                                                           
##       famrel=5,                                                         
##       health=4}       => {G3_bin=A} 0.01012658  0.8000000 7.900000     4
## [19] {Mjob=teacher,                                                     
##       freetime=5}     => {G3_bin=A} 0.01265823  0.7142857 7.053571     5
## [20] {Mjob=services,                                                    
##       Fjob=teacher,                                                     
##       paid=no}        => {G3_bin=A} 0.01265823  0.7142857 7.053571     5
## [21] {address=U,                                                        
##       Mjob=services,                                                    
##       Fjob=teacher}   => {G3_bin=A} 0.01265823  0.7142857 7.053571     5
## [22] {Mjob=services,                                                    
##       Fjob=teacher,                                                     
##       nursery=yes}    => {G3_bin=A} 0.01265823  0.7142857 7.053571     5
# visualize the grouped matrix on the first 20 rules
plot(rules.G3.A[1:20], method = "grouped")

  • Observation: (22 rules)
  • Mother is a teacher; reason choosing school is because of course; have free time after school
  • Mother is a teacher; don’t have family educational support; have extra-curricular activities
  • Mother has higher education level; reason choosing school is because of course; have free time after school
  • Male students; mother is a teacher; have free time after school
  • Father is a teacher, reason choosing school is because of course; have extra-curricular activities

Rules with grade B

rules.G3.B <- 
  apriori(data = association[,-(31:32)], parameter = list(sup = 0.01, conf = 0.8, maxlen = 4), 
          appearance = list(default="lhs",rhs=c("G3_bin=B")),
          control = list(verbose=F))
summary(rules.G3.B)
## set of 26 rules
## 
## rule length distribution (lhs + rhs):sizes
##  3  4 
##  1 25 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   4.000   4.000   3.962   4.000   4.000 
## 
## summary of quality measures:
##     support          confidence          lift           count      
##  Min.   :0.01013   Min.   :0.8000   Min.   :5.267   Min.   :4.000  
##  1st Qu.:0.01013   1st Qu.:0.8000   1st Qu.:5.267   1st Qu.:4.000  
##  Median :0.01013   Median :0.8000   Median :5.267   Median :4.000  
##  Mean   :0.01052   Mean   :0.8205   Mean   :5.402   Mean   :4.154  
##  3rd Qu.:0.01013   3rd Qu.:0.8000   3rd Qu.:5.267   3rd Qu.:4.000  
##  Max.   :0.01266   Max.   :1.0000   Max.   :6.583   Max.   :5.000  
## 
## mining info:
##                     data ntransactions support confidence
##  association[, -(31:32)]           395    0.01        0.8
rules.G3.B <- sort(rules.G3.B, by = c("confidence", "lift", "support"), decreasing = TRUE)
inspect(rules.G3.B[!is.redundant(rules.G3.B)])
##      lhs                 rhs           support confidence     lift count
## [1]  {famsize=GT3,                                                      
##       Mjob=health,                                                      
##       studytime=3}    => {G3_bin=B} 0.01012658  1.0000000 6.583333     4
## [2]  {Medu=4,                                                           
##       Fedu=2,                                                           
##       studytime=3}    => {G3_bin=B} 0.01012658  1.0000000 6.583333     4
## [3]  {activities=yes,                                                   
##       goout=4,                                                          
##       absences=6}     => {G3_bin=B} 0.01265823  0.8333333 5.486111     5
## [4]  {studytime=3,                                                      
##       paid=no,                                                          
##       Walc=3}         => {G3_bin=B} 0.01265823  0.8333333 5.486111     5
## [5]  {Medu=3,                                                           
##       famsup=no,                                                        
##       Walc=3}         => {G3_bin=B} 0.01265823  0.8333333 5.486111     5
## [6]  {Fedu=4,                                                           
##       nursery=no,                                                       
##       Dalc=1}         => {G3_bin=B} 0.01265823  0.8333333 5.486111     5
## [7]  {Mjob=health,                                                      
##       studytime=3}    => {G3_bin=B} 0.01012658  0.8000000 5.266667     4
## [8]  {Medu=2,                                                           
##       goout=4,                                                          
##       absences=6}     => {G3_bin=B} 0.01012658  0.8000000 5.266667     4
## [9]  {sex=M,                                                            
##       goout=4,                                                          
##       absences=6}     => {G3_bin=B} 0.01012658  0.8000000 5.266667     4
## [10] {Mjob=services,                                                    
##       traveltime=1,                                                     
##       absences=6}     => {G3_bin=B} 0.01012658  0.8000000 5.266667     4
## [11] {sex=M,                                                            
##       Medu=2,                                                           
##       absences=6}     => {G3_bin=B} 0.01012658  0.8000000 5.266667     4
## [12] {Medu=4,                                                           
##       Mjob=health,                                                      
##       health=3}       => {G3_bin=B} 0.01012658  0.8000000 5.266667     4
## [13] {Medu=4,                                                           
##       activities=no,                                                    
##       absences=4}     => {G3_bin=B} 0.01012658  0.8000000 5.266667     4
## [14] {internet=no,                                                      
##       Walc=2,                                                           
##       absences=2}     => {G3_bin=B} 0.01012658  0.8000000 5.266667     4
## [15] {age=17,                                                           
##       studytime=3,                                                      
##       Walc=3}         => {G3_bin=B} 0.01012658  0.8000000 5.266667     4
## [16] {sex=M,                                                            
##       internet=no,                                                      
##       Walc=2}         => {G3_bin=B} 0.01012658  0.8000000 5.266667     4
  • Observations: (16 rules)
  • Nothing of note

Rules with grade C

rules.G3.C <- 
  apriori(data = association[,-(31:32)], parameter = list(sup = 0.01, conf = 0.8, maxlen = 4), 
          appearance = list(default="lhs",rhs=c("G3_bin=C")),
          control = list(verbose=F))
summary(rules.G3.C)
## set of 36 rules
## 
## rule length distribution (lhs + rhs):sizes
##  3  4 
##  1 35 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   4.000   4.000   3.972   4.000   4.000 
## 
## summary of quality measures:
##     support          confidence          lift           count      
##  Min.   :0.01013   Min.   :0.8000   Min.   :5.097   Min.   :4.000  
##  1st Qu.:0.01013   1st Qu.:0.8000   1st Qu.:5.097   1st Qu.:4.000  
##  Median :0.01013   Median :0.8000   Median :5.097   Median :4.000  
##  Mean   :0.01027   Mean   :0.8574   Mean   :5.463   Mean   :4.056  
##  3rd Qu.:0.01013   3rd Qu.:1.0000   3rd Qu.:6.371   3rd Qu.:4.000  
##  Max.   :0.01266   Max.   :1.0000   Max.   :6.371   Max.   :5.000  
## 
## mining info:
##                     data ntransactions support confidence
##  association[, -(31:32)]           395    0.01        0.8
rules.G3.C <- sort(rules.G3.C, by = c("confidence", "lift", "support"), decreasing = TRUE)
inspect(rules.G3.C[!is.redundant(rules.G3.C)])
##      lhs                    rhs           support confidence     lift count
## [1]  {Mjob=other,                                                          
##       Dalc=5}            => {G3_bin=C} 0.01012658  1.0000000 6.370968     4
## [2]  {famsize=GT3,                                                         
##       failures=0,                                                          
##       famrel=1}          => {G3_bin=C} 0.01012658  1.0000000 6.370968     4
## [3]  {school=GP,                                                           
##       Fjob=at_home,                                                        
##       freetime=4}        => {G3_bin=C} 0.01012658  1.0000000 6.370968     4
## [4]  {Mjob=other,                                                          
##       famsup=yes,                                                          
##       Walc=5}            => {G3_bin=C} 0.01012658  1.0000000 6.370968     4
## [5]  {Medu=3,                                                              
##       failures=1,                                                          
##       freetime=5}        => {G3_bin=C} 0.01012658  1.0000000 6.370968     4
## [6]  {Mjob=other,                                                          
##       higher=yes,                                                          
##       Walc=5}            => {G3_bin=C} 0.01265823  0.8333333 5.309140     5
## [7]  {address=R,                                                           
##       famsize=LE3,                                                         
##       Medu=1}            => {G3_bin=C} 0.01265823  0.8333333 5.309140     5
## [8]  {famsize=GT3,                                                         
##       internet=yes,                                                        
##       famrel=1}          => {G3_bin=C} 0.01012658  0.8000000 5.096774     4
## [9]  {school=GP,                                                           
##       famsize=GT3,                                                         
##       famrel=1}          => {G3_bin=C} 0.01012658  0.8000000 5.096774     4
## [10] {failures=0,                                                          
##       internet=yes,                                                        
##       famrel=1}          => {G3_bin=C} 0.01012658  0.8000000 5.096774     4
## [11] {school=GP,                                                           
##       failures=0,                                                          
##       famrel=1}          => {G3_bin=C} 0.01012658  0.8000000 5.096774     4
## [12] {Fjob=at_home,                                                        
##       traveltime=1,                                                        
##       freetime=4}        => {G3_bin=C} 0.01012658  0.8000000 5.096774     4
## [13] {Mjob=other,                                                          
##       Fjob=other,                                                          
##       Walc=5}            => {G3_bin=C} 0.01012658  0.8000000 5.096774     4
## [14] {Mjob=other,                                                          
##       failures=0,                                                          
##       Walc=5}            => {G3_bin=C} 0.01012658  0.8000000 5.096774     4
## [15] {Mjob=health,                                                         
##       failures=0,                                                          
##       Walc=4}            => {G3_bin=C} 0.01012658  0.8000000 5.096774     4
## [16] {Mjob=health,                                                         
##       internet=yes,                                                        
##       Walc=4}            => {G3_bin=C} 0.01012658  0.8000000 5.096774     4
## [17] {Pstatus=T,                                                           
##       Mjob=health,                                                         
##       Walc=4}            => {G3_bin=C} 0.01012658  0.8000000 5.096774     4
## [18] {famsize=LE3,                                                         
##       reason=other,                                                        
##       famrel=4}          => {G3_bin=C} 0.01012658  0.8000000 5.096774     4
## [19] {Medu=3,                                                              
##       Fjob=services,                                                       
##       freetime=5}        => {G3_bin=C} 0.01012658  0.8000000 5.096774     4
## [20] {Medu=3,                                                              
##       activities=no,                                                       
##       freetime=5}        => {G3_bin=C} 0.01012658  0.8000000 5.096774     4
## [21] {reason=course,                                                       
##       traveltime=2,                                                        
##       freetime=5}        => {G3_bin=C} 0.01012658  0.8000000 5.096774     4
## [22] {address=R,                                                           
##       famsize=LE3,                                                         
##       Mjob=at_home}      => {G3_bin=C} 0.01012658  0.8000000 5.096774     4
## [23] {Medu=2,                                                              
##       guardian=father,                                                     
##       freetime=2}        => {G3_bin=C} 0.01012658  0.8000000 5.096774     4
## [24] {Mjob=other,                                                          
##       traveltime=2,                                                        
##       freetime=2}        => {G3_bin=C} 0.01012658  0.8000000 5.096774     4
## [25] {Dalc=2,                                                              
##       Walc=3,                                                              
##       absences=2}        => {G3_bin=C} 0.01012658  0.8000000 5.096774     4
## [26] {address=R,                                                           
##       reason=reputation,                                                   
##       internet=no}       => {G3_bin=C} 0.01012658  0.8000000 5.096774     4
## [27] {studytime=1,                                                         
##       famrel=3,                                                            
##       freetime=4}        => {G3_bin=C} 0.01012658  0.8000000 5.096774     4
## [28] {Medu=2,                                                              
##       reason=course,                                                       
##       Walc=3}            => {G3_bin=C} 0.01012658  0.8000000 5.096774     4
## [29] {sex=M,                                                               
##       Medu=3,                                                              
##       Walc=2}            => {G3_bin=C} 0.01012658  0.8000000 5.096774     4
## [30] {address=R,                                                           
##       Fjob=services,                                                       
##       guardian=father}   => {G3_bin=C} 0.01012658  0.8000000 5.096774     4
## [31] {Mjob=services,                                                       
##       goout=3,                                                             
##       health=3}          => {G3_bin=C} 0.01012658  0.8000000 5.096774     4
  • Observation: (31 rules)
  • Mother’s job is other; workday alcohol consumption level is extreme high
  • Familiy size is greater than 3; family relationship is extremely low; no class failure
  • Mother’s job is other; weekend alcohol consumption level is extreme high; have family educational support

Rules with grade D

rules.G3.D <- 
  apriori(data = association[,-(31:32)], parameter = list(sup = 0.01, conf = 0.95, maxlen = 4), 
          appearance = list(default="lhs",rhs=c("G3_bin=D")),
          control = list(verbose=F))
summary(rules.G3.D)
## set of 40 rules
## 
## rule length distribution (lhs + rhs):sizes
##  4 
## 40 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       4       4       4       4       4       4 
## 
## summary of quality measures:
##     support          confidence      lift           count      
##  Min.   :0.01013   Min.   :1    Min.   :3.835   Min.   :4.000  
##  1st Qu.:0.01013   1st Qu.:1    1st Qu.:3.835   1st Qu.:4.000  
##  Median :0.01013   Median :1    Median :3.835   Median :4.000  
##  Mean   :0.01057   Mean   :1    Mean   :3.835   Mean   :4.175  
##  3rd Qu.:0.01013   3rd Qu.:1    3rd Qu.:3.835   3rd Qu.:4.000  
##  Max.   :0.01519   Max.   :1    Max.   :3.835   Max.   :6.000  
## 
## mining info:
##                     data ntransactions support confidence
##  association[, -(31:32)]           395    0.01       0.95
rules.G3.D <- sort(rules.G3.D, by = c("confidence", "lift", "support"), decreasing = TRUE)
inspect(rules.G3.D[!is.redundant(rules.G3.D)])
##      lhs                    rhs           support confidence     lift count
## [1]  {Medu=3,                                                              
##       schoolsup=yes,                                                       
##       activities=yes}    => {G3_bin=D} 0.01518987          1 3.834951     6
## [2]  {Medu=3,                                                              
##       Fedu=3,                                                              
##       schoolsup=yes}     => {G3_bin=D} 0.01265823          1 3.834951     5
## [3]  {address=R,                                                           
##       guardian=mother,                                                     
##       absences=4}        => {G3_bin=D} 0.01265823          1 3.834951     5
## [4]  {Fedu=1,                                                              
##       reason=home,                                                         
##       absences=2}        => {G3_bin=D} 0.01265823          1 3.834951     5
## [5]  {Fedu=1,                                                              
##       paid=yes,                                                            
##       absences=2}        => {G3_bin=D} 0.01265823          1 3.834951     5
## [6]  {Fedu=2,                                                              
##       reason=reputation,                                                   
##       romantic=yes}      => {G3_bin=D} 0.01265823          1 3.834951     5
## [7]  {traveltime=4,                                                        
##       paid=no,                                                             
##       internet=yes}      => {G3_bin=D} 0.01012658          1 3.834951     4
## [8]  {age=17,                                                              
##       romantic=no,                                                         
##       goout=1}           => {G3_bin=D} 0.01012658          1 3.834951     4
## [9]  {Mjob=services,                                                       
##       studytime=1,                                                         
##       goout=1}           => {G3_bin=D} 0.01012658          1 3.834951     4
## [10] {sex=M,                                                               
##       studytime=1,                                                         
##       goout=1}           => {G3_bin=D} 0.01012658          1 3.834951     4
## [11] {famsize=LE3,                                                         
##       activities=no,                                                       
##       goout=1}           => {G3_bin=D} 0.01012658          1 3.834951     4
## [12] {studytime=4,                                                         
##       famsup=no,                                                           
##       goout=3}           => {G3_bin=D} 0.01012658          1 3.834951     4
## [13] {famrel=5,                                                            
##       goout=5,                                                             
##       absences=6}        => {G3_bin=D} 0.01012658          1 3.834951     4
## [14] {school=MS,                                                           
##       famsize=GT3,                                                         
##       Mjob=at_home}      => {G3_bin=D} 0.01012658          1 3.834951     4
## [15] {school=MS,                                                           
##       Dalc=1,                                                              
##       absences=2}        => {G3_bin=D} 0.01012658          1 3.834951     4
## [16] {reason=reputation,                                                   
##       failures=1,                                                          
##       goout=2}           => {G3_bin=D} 0.01012658          1 3.834951     4
## [17] {schoolsup=yes,                                                       
##       famrel=5,                                                            
##       absences=4}        => {G3_bin=D} 0.01012658          1 3.834951     4
## [18] {Medu=3,                                                              
##       schoolsup=yes,                                                       
##       goout=2}           => {G3_bin=D} 0.01012658          1 3.834951     4
## [19] {Medu=3,                                                              
##       reason=reputation,                                                   
##       schoolsup=yes}     => {G3_bin=D} 0.01012658          1 3.834951     4
## [20] {Medu=1,                                                              
##       guardian=mother,                                                     
##       absences=4}        => {G3_bin=D} 0.01012658          1 3.834951     4
## [21] {age=17,                                                              
##       Walc=3,                                                              
##       absences=4}        => {G3_bin=D} 0.01012658          1 3.834951     4
## [22] {Mjob=other,                                                          
##       Walc=2,                                                              
##       absences=4}        => {G3_bin=D} 0.01012658          1 3.834951     4
## [23] {freetime=3,                                                          
##       goout=2,                                                             
##       absences=4}        => {G3_bin=D} 0.01012658          1 3.834951     4
## [24] {Mjob=at_home,                                                        
##       freetime=3,                                                          
##       absences=2}        => {G3_bin=D} 0.01012658          1 3.834951     4
## [25] {Mjob=at_home,                                                        
##       reason=reputation,                                                   
##       Walc=3}            => {G3_bin=D} 0.01012658          1 3.834951     4
## [26] {Mjob=at_home,                                                        
##       Dalc=1,                                                              
##       Walc=3}            => {G3_bin=D} 0.01012658          1 3.834951     4
## [27] {famsize=GT3,                                                         
##       Mjob=at_home,                                                        
##       Walc=3}            => {G3_bin=D} 0.01012658          1 3.834951     4
## [28] {Mjob=at_home,                                                        
##       reason=reputation,                                                   
##       goout=3}           => {G3_bin=D} 0.01012658          1 3.834951     4
## [29] {Medu=1,                                                              
##       studytime=2,                                                         
##       health=4}          => {G3_bin=D} 0.01012658          1 3.834951     4
## [30] {age=18,                                                              
##       Medu=1,                                                              
##       reason=home}       => {G3_bin=D} 0.01012658          1 3.834951     4
## [31] {age=15,                                                              
##       Medu=1,                                                              
##       paid=yes}          => {G3_bin=D} 0.01012658          1 3.834951     4
## [32] {Medu=1,                                                              
##       reason=course,                                                       
##       paid=yes}          => {G3_bin=D} 0.01012658          1 3.834951     4
## [33] {sex=F,                                                               
##       age=18,                                                              
##       absences=2}        => {G3_bin=D} 0.01012658          1 3.834951     4
## [34] {age=17,                                                              
##       Walc=1,                                                              
##       absences=2}        => {G3_bin=D} 0.01012658          1 3.834951     4
## [35] {reason=home,                                                         
##       traveltime=2,                                                        
##       absences=2}        => {G3_bin=D} 0.01012658          1 3.834951     4
## [36] {age=16,                                                              
##       paid=yes,                                                            
##       internet=no}       => {G3_bin=D} 0.01012658          1 3.834951     4
## [37] {age=15,                                                              
##       activities=yes,                                                      
##       Dalc=2}            => {G3_bin=D} 0.01012658          1 3.834951     4
## [38] {age=16,                                                              
##       reason=reputation,                                                   
##       Walc=3}            => {G3_bin=D} 0.01012658          1 3.834951     4
## [39] {address=R,                                                           
##       Fedu=3,                                                              
##       health=3}          => {G3_bin=D} 0.01012658          1 3.834951     4
## [40] {address=R,                                                           
##       Mjob=services,                                                       
##       Fjob=services}     => {G3_bin=D} 0.01012658          1 3.834951     4
  • Observations : (40 rules)
  • Nothing of note

Rules with grade F

rules.G3.F <- 
  apriori(data = association[,-(31:32)], parameter = list(sup = 0.03, conf = 0.8, maxlen = 3), 
          appearance = list(default="lhs",rhs=c("G3_bin=F")),
          control = list(verbose=F))
summary(rules.G3.F)
## set of 8 rules
## 
## rule length distribution (lhs + rhs):sizes
## 2 3 
## 1 7 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   3.000   2.875   3.000   3.000 
## 
## summary of quality measures:
##     support          confidence          lift           count      
##  Min.   :0.03038   Min.   :0.8000   Min.   :2.431   Min.   :12.00  
##  1st Qu.:0.03038   1st Qu.:0.8000   1st Qu.:2.431   1st Qu.:12.00  
##  Median :0.03038   Median :0.8180   Median :2.486   Median :12.00  
##  Mean   :0.03133   Mean   :0.8259   Mean   :2.510   Mean   :12.38  
##  3rd Qu.:0.03101   3rd Qu.:0.8571   3rd Qu.:2.604   3rd Qu.:12.25  
##  Max.   :0.03544   Max.   :0.8571   Max.   :2.604   Max.   :14.00  
## 
## mining info:
##                     data ntransactions support confidence
##  association[, -(31:32)]           395    0.03        0.8
rules.G3.F <- sort(rules.G3.F, by = c("confidence", "lift", "support"), decreasing = TRUE)
inspect(rules.G3.F[!is.redundant(rules.G3.F)])
##     lhs                          rhs        support    confidence lift    
## [1] {failures=3,schoolsup=no} => {G3_bin=F} 0.03037975 0.8571429  2.604396
## [2] {school=GP,failures=2}    => {G3_bin=F} 0.03037975 0.8571429  2.604396
## [3] {goout=5,absences=0}      => {G3_bin=F} 0.03037975 0.8571429  2.604396
## [4] {failures=2}              => {G3_bin=F} 0.03544304 0.8235294  2.502262
##     count
## [1] 12   
## [2] 12   
## [3] 12   
## [4] 14
  • Observation: (4 rules)
  • Students had failed 3 classes and no extra educational support from schools
  • Students had failed 2 classes
  • Stduents going out with friends frequently and have no absences

Cluster analysis

k-means needs to have numeric data, so use math.num and standardize the data

math.stand<- math.num %>% scale()

Find the optimal number of clusters

1. Gap Statistic Method

Minimizes the intra-cluster variation. The Gap Statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be the value that maximize the gap statistic.

gs.graph <- fviz_nbclust(math.stand, kmeans, method = "gap_stat") + 
  labs(subtitle = "Gap Statistic Method")
gs.graph

2. Elbow Method

The Elbow Method to define clusters such that the total intra-cluster variation, or total within-cluster sum of square (WSS), is minimized. The optimal number of clusters is the k value such that adding an additional cluster will not improve WSS by a significant amount.

elbow <- fviz_nbclust(math.stand, kmeans, method = "wss") + 
  labs(subtitle = "Elbow Method")

elbow

Elbow Method - alternative

set.seed(23)
# Function to compute totla within-cluster sum of square
wss <- function(x){
  kmeans(math.num, x, nstart = 10)$tot.withinss
}

# Compute and plot wss for k = 1 to k = 10 (or 15)
k.values <- 1:10

# Extract wss for 2 - 15 clusters
wss.values <- map_dbl(k.values, wss)

# Create a dataframe of k.values and wss.values to plotf
wss.df <- as.data.frame(k.values,wss.values)

# Elbow graph
elbow2 <- ggplot(data = wss.df, aes(x = k.values, y = wss.values)) +
  xlab("Number of Clusters, K") +
  ylab("Total Within-Clusters Sum of Squares") +
  geom_point() + 
  geom_line() +
  geom_vline(xintercept = 4, linetype = "dashed") +
  ggtitle("Optimal number of clusters", subtitle = "Elbow Method") +
  theme_classic()
elbow2

3. Silhouette Method

Silhouette measures how well an observation is clustered and approximates the average distance between clusters. Silhouette values fall between -1 and 1. A negative value suggests the data point is in the wrong cluster, a value near 0 means the data point is between two clusters, and values closer to 1 mean the data point is in the proper cluster.

sil <- fviz_nbclust(math.stand, kmeans, method = "silhouette") + 
  labs(subtitle = "Silhouette Method")

Plot each graph to find optimal number of clusters.

grid.arrange(gs.graph, elbow2, sil, nrow = 3)

  • Observations:
  • The Gap Statistic and Elbow Method suggest 4 clusters, and the Silhouette method suggests 2 clusters.
  • Try k = 2 - 5 to find the optimal clustering.

k = 2

set.seed(19)
math.k2 <- kmeans(math.num, 
                  centers = 2, 
                  nstart = 25)
math.k2
## K-means clustering with 2 clusters of sizes 64, 331
## 
## Cluster means:
##      school       sex      age   address   famsize    Pstatus     Medu
## 1 0.0625000 0.4218750 17.31250 0.1875000 0.6406250 0.18750000 2.906250
## 2 0.1268882 0.4833837 16.57704 0.2296073 0.7250755 0.08761329 2.719033
##       Fedu     Mjob     Fjob   reason  guardian traveltime studytime
## 1 2.562500 2.546875 3.031250 1.046875 0.5156250   1.375000  1.828125
## 2 2.513595 2.471299 2.942598 1.317221 0.3655589   1.462236  2.075529
##    failures schoolsup    famsup      paid activities   nursery    higher
## 1 0.5625000 0.1250000 0.6718750 0.4375000  0.4375000 0.8125000 0.9375000
## 2 0.2900302 0.1299094 0.6012085 0.4622356  0.5226586 0.7915408 0.9516616
##    internet  romantic   famrel freetime    goout     Dalc     Walc
## 1 0.8906250 0.4531250 3.796875 3.265625 3.312500 1.765625 2.796875
## 2 0.8217523 0.3111782 3.972810 3.229607 3.069486 1.425982 2.193353
##     health  absences      G1       G2       G3
## 1 3.468750 19.234375 10.4375 10.06250 10.06250
## 2 3.570997  3.093656 11.0000 10.83988 10.48338
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2 2
##  [36] 2 2 2 2 2 1 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1
##  [71] 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2
## [106] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2
## [141] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2
## [176] 2 2 2 2 2 2 2 2 1 1 1 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 1 2 1 2 2 2 2
## [211] 2 1 2 1 1 2 1 1 2 2 2 2 2 2 2 1 2 2 1 2 1 2 1 2 1 2 2 1 2 2 1 2 2 2 2
## [246] 2 2 1 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 1 1 1 2
## [281] 1 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 2 2 1 2 2 1 2 1 2 1 2 1 1
## [316] 1 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
## [351] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 1
## [386] 2 2 2 2 2 1 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 11288.34 26085.58
##  (between_SS / total_SS =  27.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"
math.k.plot <- math %>% 
  mutate(cluster2 = as.factor(math.k2$cluster)) %>% 
  mutate(id = 1:nrow(math))

Plot the clusters

p2 <- fviz_cluster(math.k2, 
                   data = math.num)
p2

Rule with higher education appears many times - Students at age 15 want to get higher education - Students whose mother has higher education level tend to get higher education as well - Female wants to get higher education

  • Observation: (22 rules)
  • Mother is a teacher; reason choosing school is because of course; have free time after school
  • Mother is a teacher; don’t have family educational support; have extra-curricular activities
  • Mother has higher education level; reason choosing school is because of course; have free time after school
  • Male students; mother is a teacher; have free time after school
  • Father is a teacher, reason choosing school is because of course; have extra-curricular activities

Plot each student and their score on G3

ggplot(data = math.k.plot, aes(x = Medu, y = G3)) +
  geom_jitter(aes(color = cluster2)) +
  facet_grid(~Mjob) +
  xlab("Mother's Education Level") +
  ylab("Final Grade") +
  ggtitle("Mother's Education vs Final Grade", subtitle = "Mother's Job") +
  geom_hline(yintercept = 10) +
  annotate(geom="text", label = "Pass", x = 1, y = 9.5, vjust=-1.5) +
  annotate(geom="text", label = "Fail", x = 1, y = 8.5, vjust = 1) +
  theme_light()

Mjob: 0 = ‘Teacher’, 1 = ‘Health’ care related, 2 = Civil ‘services’, 3 = ‘at_home’, 4 = ‘other’

The majority of children whose mothers are teachers or in the health care field score well on their final grade. ### k = 2

set.seed(19)
math.k2 <- kmeans(math.num, 
                  centers = 2, 
                  nstart = 25)
math.k2
## K-means clustering with 2 clusters of sizes 64, 331
## 
## Cluster means:
##      school       sex      age   address   famsize    Pstatus     Medu
## 1 0.0625000 0.4218750 17.31250 0.1875000 0.6406250 0.18750000 2.906250
## 2 0.1268882 0.4833837 16.57704 0.2296073 0.7250755 0.08761329 2.719033
##       Fedu     Mjob     Fjob   reason  guardian traveltime studytime
## 1 2.562500 2.546875 3.031250 1.046875 0.5156250   1.375000  1.828125
## 2 2.513595 2.471299 2.942598 1.317221 0.3655589   1.462236  2.075529
##    failures schoolsup    famsup      paid activities   nursery    higher
## 1 0.5625000 0.1250000 0.6718750 0.4375000  0.4375000 0.8125000 0.9375000
## 2 0.2900302 0.1299094 0.6012085 0.4622356  0.5226586 0.7915408 0.9516616
##    internet  romantic   famrel freetime    goout     Dalc     Walc
## 1 0.8906250 0.4531250 3.796875 3.265625 3.312500 1.765625 2.796875
## 2 0.8217523 0.3111782 3.972810 3.229607 3.069486 1.425982 2.193353
##     health  absences      G1       G2       G3
## 1 3.468750 19.234375 10.4375 10.06250 10.06250
## 2 3.570997  3.093656 11.0000 10.83988 10.48338
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2 2
##  [36] 2 2 2 2 2 1 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1
##  [71] 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2
## [106] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2
## [141] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2
## [176] 2 2 2 2 2 2 2 2 1 1 1 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 1 2 1 2 2 2 2
## [211] 2 1 2 1 1 2 1 1 2 2 2 2 2 2 2 1 2 2 1 2 1 2 1 2 1 2 2 1 2 2 1 2 2 2 2
## [246] 2 2 1 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 1 1 1 2
## [281] 1 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 2 2 1 2 2 1 2 1 2 1 2 1 1
## [316] 1 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
## [351] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 1
## [386] 2 2 2 2 2 1 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 11288.34 26085.58
##  (between_SS / total_SS =  27.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"
math.k.plot <- math %>% 
  mutate(cluster2 = as.factor(math.k2$cluster)) %>% 
  mutate(id = 1:nrow(math))

Plot the clusters

p2 <- fviz_cluster(math.k2, 
                   data = math.num)
p2

Rule with higher education appears many times - Students at age 15 want to get higher education - Students whose mother has higher education level tend to get higher education as well - Female wants to get higher education

  • Observation: (22 rules)
  • Mother is a teacher; reason choosing school is because of course; have free time after school
  • Mother is a teacher; don’t have family educational support; have extra-curricular activities
  • Mother has higher education level; reason choosing school is because of course; have free time after school
  • Male students; mother is a teacher; have free time after school
  • Father is a teacher, reason choosing school is because of course; have extra-curricular activities

Plot each student and their score on G3

ggplot(data = math.k.plot, aes(x = Medu, y = G3)) +
  geom_jitter(aes(color = cluster2)) +
  facet_grid(~Mjob) +
  xlab("Mother's Education Level") +
  ylab("Final Grade") +
  ggtitle("Mother's Education vs Final Grade", subtitle = "Mother's Job") +
  geom_hline(yintercept = 10) +
  annotate(geom="text", label = "Pass", x = 1, y = 9.5, vjust=-1.5) +
  annotate(geom="text", label = "Fail", x = 1, y = 8.5, vjust = 1) +
  theme_light()

Mjob: 0 = ‘Teacher’, 1 = ‘Health’ care related, 2 = Civil ‘services’, 3 = ‘at_home’, 4 = ‘other’

The majority of children whose mothers are teachers or in the health care field score well on their final grade.

k = 3

set.seed(19)
math.k3 <- kmeans(math.num,
                  centers = 3,
                  nstart = 25)
math.k3
## K-means clustering with 3 clusters of sizes 156, 57, 182
## 
## Cluster means:
##       school       sex      age   address   famsize    Pstatus     Medu
## 1 0.13461538 0.4038462 16.73718 0.2628205 0.7564103 0.07692308 2.461538
## 2 0.05263158 0.4210526 17.28070 0.1578947 0.6666667 0.15789474 2.964912
## 3 0.12087912 0.5494505 16.47802 0.2087912 0.6868132 0.10989011 2.928571
##       Fedu     Mjob     Fjob   reason  guardian traveltime studytime
## 1 2.288462 2.653846 3.115385 1.346154 0.4230769   1.551282  2.000000
## 2 2.614035 2.508772 3.017544 1.000000 0.4561404   1.368421  1.754386
## 3 2.692308 2.329670 2.802198 1.296703 0.3406593   1.384615  2.153846
##     failures  schoolsup    famsup      paid activities   nursery    higher
## 1 0.54487179 0.19230769 0.6474359 0.4166667  0.5000000 0.7948718 0.9102564
## 2 0.56140351 0.14035088 0.6842105 0.4210526  0.4561404 0.8245614 0.9298246
## 3 0.08241758 0.07142857 0.5604396 0.5054945  0.5329670 0.7857143 0.9890110
##    internet  romantic   famrel freetime    goout     Dalc     Walc
## 1 0.7884615 0.3397436 4.006410 3.211538 3.269231 1.455128 2.314103
## 2 0.9122807 0.4912281 3.754386 3.192982 3.298246 1.701754 2.754386
## 3 0.8461538 0.2802198 3.950549 3.269231 2.912088 1.434066 2.126374
##     health  absences        G1        G2        G3
## 1 3.557692  3.153846  8.108974  7.602564  6.673077
## 2 3.438596 20.122807 10.315789  9.894737  9.877193
## 3 3.587912  3.384615 13.494505 13.637363 13.791209
## 
## Clustering vector:
##   [1] 1 1 1 3 1 3 3 1 3 3 1 3 3 1 3 3 3 1 2 1 3 3 3 3 1 2 3 3 3 2 3 3 3 1 3
##  [36] 1 3 3 3 3 2 3 3 1 2 1 2 3 3 1 3 3 3 1 3 1 3 3 1 3 3 1 1 1 1 3 3 1 1 3
##  [71] 3 1 1 3 2 1 3 3 1 1 3 3 1 3 1 1 1 3 2 2 1 3 1 1 3 1 3 1 3 1 2 3 3 2 3
## [106] 3 1 3 3 3 3 1 3 3 1 3 3 3 2 3 3 3 3 2 1 3 1 1 1 3 1 1 2 2 1 1 1 1 3 3
## [141] 1 1 3 3 1 1 1 3 1 1 1 3 1 1 3 1 3 1 3 3 1 1 1 1 1 2 1 3 1 3 1 3 3 1 1
## [176] 1 3 1 1 1 1 3 3 2 2 2 3 3 1 1 3 1 1 1 3 3 3 1 2 1 3 1 1 2 1 2 1 3 1 1
## [211] 1 2 3 2 2 3 2 2 1 1 1 1 3 3 3 2 3 3 2 3 2 3 2 3 2 1 3 2 3 1 2 3 1 3 1
## [246] 3 3 2 1 3 1 1 1 1 3 1 3 2 3 1 2 1 3 1 1 3 1 3 1 1 2 3 3 3 1 3 2 2 2 1
## [281] 2 2 3 1 1 3 3 3 3 3 3 3 3 3 3 3 1 1 3 3 2 3 3 3 2 3 3 2 3 2 1 2 3 2 2
## [316] 2 1 1 3 3 2 2 3 3 3 3 3 1 1 3 1 3 1 1 1 2 3 1 3 1 3 1 3 1 1 3 3 1 3 3
## [351] 1 3 1 1 3 1 3 3 1 3 3 3 3 3 3 1 3 1 1 3 1 3 3 2 3 1 3 1 3 2 3 1 3 1 2
## [386] 1 1 1 1 1 1 3 1 3 1
## 
## Within cluster sum of squares by cluster:
## [1]  9008.250 10322.596  8303.434
##  (between_SS / total_SS =  46.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"
p3 <- fviz_cluster(math.k3,
                   data = math.num)
p3

math.k.plot <- math.k.plot %>% 
  mutate(cluster3 = as.factor(math.k3$cluster))
ggplot(data = math.k.plot, aes(x = Medu, y = G3)) +
  geom_jitter(aes(color = cluster3)) +
  facet_grid(~Mjob) +
  xlab("Mother's Education Level") +
  ylab("Final Grade") +
  ggtitle("Mother's Education vs Final Grade", subtitle = "Mother's Job") +
  geom_hline(yintercept = 10) +
  annotate(geom="text", label = "Pass", x = 1.3, y = 9.5, vjust=-1.5) +
  annotate(geom="text", label = "Fail", x = 1.3, y = 8.5, vjust = 1) +
  theme_light()

ggplot(data = math.k.plot, aes(x = schoolsup, y = G3)) +
 geom_jitter(aes(color = cluster3)) +
 facet_grid(~failures) +
 xlab("Student") +
 ylab("Final Grade") +
 geom_hline(yintercept = 10) +
 annotate(geom="text", label = "Pass", x = 1.3, y = 9.5, vjust = -1.5) +
 annotate(geom="text", label = "Fail", x = 1.3, y = 8.5, vjust = 0) +
 ggtitle("School Support vs Final Score", subtitle = "Number of Failures") +
  theme_light()

ggplot(data = math.k.plot, aes(x = Walc, y = G3)) +
  geom_jitter(aes(color = cluster3)) +
  facet_grid(~Mjob) +
  xlab("Weekend Alcohol Consumption") +
  ylab("Final Grade") + 
  ggtitle("Weekend Alcohol Consumption vs Final Grade", subtitle = "Mother's Job") +
  geom_hline(yintercept = 10) +
  annotate(geom="text", label = "Pass", x = 1.3, y = 9.5, vjust = -1.5) +
  annotate(geom="text", label = "Fail", x = 1.3, y = 8.5, vjust = 0) +
  theme_light()

The students seem to be clustered by the performance on G3. Students who scored above 11 are in Cluster 3, 4-18 in Cluster 2 and 0-12 in cluster 3.

ggplot(data = math.k.plot, aes(x = famrel, y = G3)) +
  geom_jitter(aes(color = cluster3)) +
  facet_grid(~freetime) +
  xlab("Family Relationship") +
  ylab("Final Grade") + 
  ggtitle("Family Relationship vs Final Grade", subtitle = "Amount of Freetime") +
  geom_hline(yintercept = 10) +
  annotate(geom="text", label = "Pass", x = 1.3, y = 9.5, vjust = -1.5) +
  annotate(geom="text", label = "Fail", x = 1.2, y = 8.5, vjust = 0) +
  theme_light()

k = 4

set.seed(19)
math.k4 <- kmeans(math.num,
                  centers = 4,
                  nstart = 25)
math.k4
## K-means clustering with 4 clusters of sizes 5, 138, 173, 79
## 
## Cluster means:
##       school       sex      age   address   famsize    Pstatus     Medu
## 1 0.00000000 0.2000000 17.80000 0.4000000 0.8000000 0.20000000 3.000000
## 2 0.14492754 0.3768116 16.73188 0.2536232 0.7608696 0.07971014 2.391304
## 3 0.11560694 0.5664740 16.42775 0.2023121 0.6763006 0.09826590 2.913295
## 4 0.07594937 0.4556962 17.15190 0.2025316 0.6962025 0.15189873 3.000000
##       Fedu     Mjob     Fjob   reason  guardian traveltime studytime
## 1 3.000000 3.200000 2.800000 0.600000 0.8000000   1.400000  2.000000
## 2 2.253623 2.746377 3.101449 1.326087 0.3913043   1.536232  1.985507
## 3 2.722543 2.335260 2.791908 1.283237 0.3352601   1.381503  2.144509
## 4 2.518987 2.303797 3.075949 1.202532 0.4810127   1.443038  1.886076
##     failures  schoolsup    famsup      paid activities   nursery    higher
## 1 0.40000000 0.20000000 0.6000000 0.4000000  0.4000000 0.8000000 0.8000000
## 2 0.56521739 0.21014493 0.6304348 0.3768116  0.4782609 0.7826087 0.8985507
## 3 0.06936416 0.06936416 0.5838150 0.5144509  0.5375723 0.7803468 0.9884393
## 4 0.50632911 0.11392405 0.6455696 0.4810127  0.5063291 0.8481013 0.9620253
##    internet  romantic   famrel freetime    goout     Dalc     Walc
## 1 1.0000000 0.8000000 4.200000 2.200000 2.600000 1.400000 2.000000
## 2 0.7681159 0.3405797 4.000000 3.188406 3.246377 1.376812 2.217391
## 3 0.8497110 0.2658960 3.994220 3.271676 2.919075 1.410405 2.127168
## 4 0.8987342 0.4430380 3.721519 3.303797 3.316456 1.822785 2.797468
##     health  absences        G1        G2       G3
## 1 3.600000 52.600000 10.200000 10.000000  9.40000
## 2 3.572464  2.391304  7.985507  7.442029  6.34058
## 3 3.589595  2.965318 13.560694 13.728324 13.87861
## 4 3.443038 14.544304 10.253165  9.873418 10.01266
## 
## Clustering vector:
##   [1] 2 2 4 3 2 3 3 2 3 3 2 3 3 3 3 3 3 2 4 2 3 3 3 3 2 4 3 3 3 4 3 3 3 2 3
##  [36] 2 3 3 3 3 4 3 3 2 4 2 4 3 3 2 3 3 3 2 3 2 3 3 2 3 3 2 2 2 2 3 3 2 2 3
##  [71] 3 2 2 3 1 2 4 3 2 4 3 3 2 3 2 2 2 3 4 4 2 3 2 2 3 2 3 2 3 2 4 3 3 4 3
## [106] 4 2 3 3 3 3 2 3 3 2 3 3 3 4 3 3 3 3 4 2 3 2 2 2 3 2 2 4 4 2 2 2 2 3 3
## [141] 2 2 3 3 2 2 2 3 2 2 2 3 4 2 3 2 3 2 3 3 2 2 2 2 2 4 2 3 2 3 2 3 3 2 2
## [176] 2 3 2 4 2 4 3 3 1 4 4 3 3 2 2 4 2 4 4 3 3 3 4 4 2 3 2 2 4 3 4 2 4 2 2
## [211] 4 4 3 4 4 3 4 4 2 2 2 2 3 3 3 4 3 3 4 4 4 3 4 3 4 4 3 4 3 2 4 3 2 3 2
## [246] 3 3 4 2 3 2 2 2 2 3 2 3 4 3 2 4 2 3 2 2 3 2 4 4 2 4 3 3 3 2 3 1 4 4 4
## [281] 4 4 3 2 2 3 3 3 3 3 4 3 4 3 3 3 2 4 3 3 4 3 3 3 4 3 3 1 3 4 2 4 3 4 4
## [316] 1 2 4 3 3 4 4 3 3 3 3 3 4 2 3 2 3 2 2 2 4 4 2 3 2 3 2 3 2 2 3 3 2 3 4
## [351] 2 3 2 2 3 2 3 3 2 3 3 3 3 3 3 2 3 2 2 4 2 3 3 4 3 2 3 2 3 4 3 2 3 2 4
## [386] 2 2 2 2 2 4 3 2 3 2
## 
## Within cluster sum of squares by cluster:
## [1]  999.600 7538.819 7441.156 5379.595
##  (between_SS / total_SS =  58.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"
p4 <- fviz_cluster(math.k4,
                   data = math.num)
p4

Add the cluster to the dataframe

math.k.plot <- math.k.plot %>% 
  mutate(cluster4 = as.factor(math.k4$cluster))

Plot the clusters on x and students on y

ggplot(data = math.k.plot, aes(x = Medu, y = G3)) +
  geom_jitter(aes(color = cluster4)) +
  facet_grid(~Mjob) +
  xlab("Mother's Education Level") +
  ylab("Final Grade") +
  ggtitle("Mother's Education vs Final Grade", subtitle = "Mother's Job") +
  geom_hline(yintercept = 10) +
  annotate(geom="text", label = "Pass", x = 1, y = 9.5, vjust=-1.5) +
  annotate(geom="text", label = "Fail", x = 1, y = 8.5, vjust = 1)

ggplot(data = math.k.plot, aes(x = schoolsup, y = G3)) +
  geom_jitter(aes(color = cluster4)) +
  facet_grid(~failures) +
  xlab("Student") +
  ylab("Final Grade") +
  geom_hline(yintercept = 10) +
  annotate(geom="text", label = "Pass", x = 1, y = 9.5, vjust = -1.5) +
  annotate(geom="text", label = "Fail", x = 1, y = 8.5, vjust = 0)

If a student fails a class 2 or more times, they are likely to fail it again.

k = 5

set.seed(19)
math.k5 <- kmeans(math.num,
                  centers = 5,
                  nstart = 25)
math.k5
## K-means clustering with 5 clusters of sizes 148, 5, 127, 72, 43
## 
## Cluster means:
##       school       sex      age   address   famsize    Pstatus     Medu
## 1 0.15540541 0.4324324 16.56757 0.2770270 0.7229730 0.09459459 2.547297
## 2 0.00000000 0.2000000 17.80000 0.4000000 0.8000000 0.20000000 3.000000
## 3 0.10236220 0.5826772 16.40945 0.1653543 0.6771654 0.09448819 3.015748
## 4 0.08333333 0.4305556 17.18056 0.1527778 0.6805556 0.16666667 2.958333
## 5 0.09302326 0.3953488 17.04651 0.3023256 0.8139535 0.04651163 2.279070
##       Fedu     Mjob     Fjob   reason  guardian traveltime studytime
## 1 2.405405 2.608108 3.067568 1.351351 0.3716216   1.554054  2.040541
## 2 3.000000 3.200000 2.800000 0.600000 0.8000000   1.400000  2.000000
## 3 2.748031 2.236220 2.740157 1.267717 0.3149606   1.338583  2.141732
## 4 2.500000 2.416667 3.069444 1.180556 0.4861111   1.361111  1.875000
## 5 2.232558 2.813953 3.046512 1.255814 0.4651163   1.558140  1.976744
##     failures  schoolsup    famsup      paid activities   nursery    higher
## 1 0.31756757 0.22972973 0.6216216 0.5135135  0.4797297 0.7837838 0.9459459
## 2 0.40000000 0.20000000 0.6000000 0.4000000  0.4000000 0.8000000 0.8000000
## 3 0.05511811 0.03937008 0.5748031 0.4803150  0.5669291 0.8188976 0.9842520
## 4 0.52777778 0.12500000 0.6527778 0.4722222  0.4583333 0.8333333 0.9583333
## 5 0.88372093 0.04651163 0.6279070 0.1860465  0.5348837 0.6976744 0.8604651
##    internet  romantic   famrel freetime    goout     Dalc     Walc
## 1 0.7770270 0.2702703 3.993243 3.195946 3.114865 1.500000 2.358108
## 2 1.0000000 0.8000000 4.200000 2.200000 2.600000 1.400000 2.000000
## 3 0.8661417 0.2598425 4.031496 3.307087 2.937008 1.354331 2.047244
## 4 0.8888889 0.4583333 3.722222 3.347222 3.333333 1.763889 2.763889
## 5 0.8139535 0.5116279 3.860465 3.093023 3.279070 1.325581 2.023256
##     health   absences        G1        G2         G3
## 1 3.594595  3.3175676  9.236486  9.594595  9.7162162
## 2 3.600000 52.6000000 10.200000 10.000000  9.4000000
## 3 3.496063  3.0472441 14.488189 14.582677 14.8346457
## 4 3.486111 15.1944444 10.236111  9.805556  9.9583333
## 5 3.697674  0.4651163  7.302326  4.744186  0.6511628
## 
## Clustering vector:
##   [1] 1 5 4 3 1 3 1 1 3 3 1 1 3 1 3 3 3 1 4 1 3 3 3 3 1 4 1 3 1 4 1 3 3 1 3
##  [36] 1 3 3 1 3 4 1 3 1 4 1 4 3 3 1 3 3 1 1 1 1 3 3 1 3 1 1 1 1 1 3 3 1 1 3
##  [71] 3 1 5 3 2 1 1 1 1 4 1 1 4 3 1 1 1 3 4 4 1 3 1 1 3 1 3 1 3 1 4 3 3 4 3
## [106] 4 1 3 1 3 3 1 1 3 1 3 3 3 4 3 3 3 3 4 1 3 1 1 5 3 5 5 4 4 5 5 5 5 3 3
## [141] 5 1 1 3 5 1 5 1 5 1 5 3 1 5 1 1 3 1 3 1 5 1 5 1 1 4 1 3 5 3 5 3 1 5 1
## [176] 1 3 5 4 1 4 3 3 2 4 4 1 3 1 1 4 1 4 1 3 3 3 1 4 1 3 1 1 4 1 4 1 4 1 1
## [211] 4 4 3 4 4 3 4 4 1 1 5 5 3 3 3 4 3 1 4 4 4 1 4 3 4 4 3 4 1 5 4 1 5 3 5
## [246] 3 3 4 5 3 1 1 1 1 1 1 3 4 3 5 4 1 3 1 5 3 1 1 4 5 4 3 1 3 1 1 2 4 4 1
## [281] 4 4 1 1 1 1 3 3 3 3 4 3 4 3 3 3 5 4 3 3 4 1 3 3 4 3 3 2 3 4 5 4 1 4 4
## [316] 2 5 1 1 1 4 4 1 3 3 1 3 1 1 3 1 3 5 5 5 4 4 5 3 1 1 5 3 5 1 3 3 1 3 4
## [351] 1 3 1 1 1 1 3 1 1 3 3 3 1 3 1 1 3 5 1 4 1 3 1 4 3 1 3 1 3 4 3 1 1 5 4
## [386] 1 1 5 1 5 4 3 1 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 5274.662  999.600 4976.913 4895.889 1716.233
##  (between_SS / total_SS =  65.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"
p5 <- fviz_cluster(math.k5,
                   data = math.num)
p5

Add the cluster to the dataframe

math.k.plot <- math.k.plot %>% 
  mutate(cluster5 = as.factor(math.k5$cluster))

Plot the clusters on x and students on y

ggplot(data = math.k.plot, aes(x = Medu, y = G3)) +
  geom_jitter(aes(color = cluster5)) +
  facet_grid(~Mjob) +
  xlab("Mother's Education Level") +
  ylab("Final Grade") +
  ggtitle("Mother's Education vs Final Grade", subtitle = "Mother's Job") +
  geom_hline(yintercept = 10) +
  annotate(geom="text", label = "Pass", x = 1, y = 9.5, vjust = -1.5) +
  annotate(geom="text", label = "Fail", x = 1, y = 8.5, vjust = 1) +
  theme_light()

Cluster 1 has students that scored 11 or higher. This is the majority of students who earned a passing score on G3. Cluster 3 has those students that scored 0 on G3, suggesting that they did not take the final exam. Cluster 2 has teh largest variance of scores, from 5 to 18. Cluster 4 has those middling students (scores between 6 and 13) who would likely pass the exam with extra help. Cluster 5 has less than 10 students.

The majority of students in Cluster 1 passed the G3 exam. What do they have in common?

clust.1.math.k5 <- math.k.plot %>% 
  filter(cluster5 == "1")

ggplot(data = math[clust.1.math.k5$id, ], aes(x = Medu, y = G3)) +
  geom_jitter(aes(color = Mjob)) +
  facet_grid(~studytime) +
  ggtitle("Cluster 1: Mother's Job vs Mother's Education", subtitle = "Student's study time level") +
  labs(x = "Education Level",
       y = "G3 Grade") +
  geom_hline(yintercept = 9) +
   annotate(geom="text", label = "Pass", x = 0.75, y = 9.5, vjust = -0.5) +
  annotate(geom="text", label = "Fail", x = 0.75, y = 8.5, vjust = 1) +
  theme_light()

It looks like those students that have final scores between 6 and 11 should put in more study time. There are more see through dots than solid ones.

clust.2.math.k5 <- math.k.plot %>% 
  filter(cluster5 == "2")

ggplot(data = math[clust.2.math.k5$id, ], aes(x = Medu, y = G3)) +
  geom_jitter(aes(color = Mjob)) +
  facet_grid(~studytime) +
  ggtitle("Cluster 2: Mother's Job vs Mother's Education", subtitle = "Student's study time level") +
  labs(x = "Education Level",
       y = "G3 Grade") +
  geom_hline(yintercept = 9) +
  annotate(geom="text", label = "Pass", x = 2.2, y = 9.5, vjust = -0.5) +
  annotate(geom="text", label = "Fail", x = 2.2, y = 8.5, vjust = 1) +
  theme_light()

clust.3.math.k5 <- math.k.plot %>% 
  filter(cluster5 == "3")

ggplot(data = math[clust.3.math.k5$id, ], aes(x = Medu, y = G3)) +
  geom_jitter(aes(color = Mjob)) +
  facet_grid(~studytime) +
  ggtitle("Cluster 3: Mother's Job vs Mother's Education", subtitle = "Student's study time level") +
  labs(x = "Education Level",
       y = "G3 Grade") +
  geom_hline(yintercept = 9) +
  annotate(geom="text", label = "Pass", x = 0.75, y = 9.5, vjust = -0.5) +
  annotate(geom="text", label = "Fail", x = 0.75, y = 8.5, vjust = 1) +
  theme_light()

clust.4.math.k5 <- math.k.plot %>% 
  filter(cluster5 == "4")

ggplot(data = math[clust.4.math.k5$id, ], aes(x = Medu, y = G3)) +
  geom_jitter(aes(color = Mjob)) +
  facet_grid(~studytime) +
  ggtitle("Cluster 4: Mother's Job vs Mother's Education", subtitle = "Student's study time level") +
  labs(x = "Education Level",
       y = "G3 Grade") +
  geom_hline(yintercept = 9) +
  annotate(geom="text", label = "Pass", x = 1.3, y = 9.5, vjust = -0.5) +
  annotate(geom="text", label = "Fail", x = 1.3, y = 8.5, vjust = 1) +
  theme_light()

clust.5.math.k5 <- math.k.plot %>% 
  filter(cluster5 == "5")

ggplot(data = math[clust.5.math.k5$id, ], aes(x = Medu, y = G3)) +
  geom_jitter(aes(color = Mjob)) +
  facet_grid(~studytime) +
  ggtitle("Cluster 5: Mother's Job vs Mother's Education", subtitle = "Student's study time level") +
  labs(x = "Education Level",
       y = "G3 Grade") +
  geom_hline(yintercept = 9) +
  annotate(geom="text", label = "Pass", x = 1.1, y = 9.5, vjust = -0.2) +
  annotate(geom="text", label = "Fail", x = 1.1, y = 8.5, vjust = 1) +
  theme_light()

Naive Bayes

Use alphabet coding

math.nb <- math.num %>% 
  mutate(G1_bin = ifelse(
    G1 < 10, "F", ifelse(
      G1 < 12, "D", ifelse(
        G1 < 14, "C", ifelse(
          G1 < 16, "B", "A"  )  )  )  )  ) %>%
  mutate(G2_bin = ifelse(
    G2 < 10, "F", ifelse(
      G2 < 12, "D", ifelse(
        G2 < 14, "C", ifelse(
          G2 < 16, "B", "A"  )  )  )  )  ) %>%
  mutate(G3_bin = ifelse(
    G3 < 10, "F", ifelse(
      G3 < 12, "D", ifelse(
        G3 < 14, "C", ifelse(
          G3 < 16, "B", "A"  )  )  )  )  )

main.var <- c("G3_bin", "G2_bin", "G1_bin", "absences",  "goout", "Medu", "Walc" ,"reason","freetime", "famrel", "health", "age", "Mjob", "Fjob", "Fedu", "studytime")

math.nb <- math.nb %>%
 select(main.var)

preProc <- preProcess(math.nb, method = c("range","nzv"))
math.nb.scale <- predict(preProc, newdata=math.nb)
math.nb.scale <- math.nb.scale %>% mutate_if(is.character, as.factor)

Split the data into train (70%) and test(30%)

set.seed(19)
inTraining <- createDataPartition(math.nb.scale$G3_bin, p = 0.7, list = FALSE)
math.train <- math.nb.scale[inTraining,]
math.test <- math.nb.scale[-inTraining,]
set.seed(19)
nb.model1 <- naiveBayes(G3_bin ~., data = math.train)

Predict with default model

set.seed(19)
nb.model.pred <- predict(nb.model1, newdata = math.test, type = "class")

caret::confusionMatrix(data = nb.model.pred,
                reference = math.test$G3_bin, mode = "prec_recall")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  A  B  C  D  F
##          A 11  1  0  0  0
##          B  1 13  2  0  0
##          C  0  4 11  4  0
##          D  0  0  5 20  3
##          F  0  0  0  6 36
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7778          
##                  95% CI : (0.6916, 0.8494)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7085          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: F
## Precision             0.91667   0.8125  0.57895   0.7143   0.8571
## Recall                0.91667   0.7222  0.61111   0.6667   0.9231
## F1                    0.91667   0.7647  0.59459   0.6897   0.8889
## Prevalence            0.10256   0.1538  0.15385   0.2564   0.3333
## Detection Rate        0.09402   0.1111  0.09402   0.1709   0.3077
## Detection Prevalence  0.10256   0.1368  0.16239   0.2393   0.3590
## Balanced Accuracy     0.95357   0.8460  0.76515   0.7874   0.9231

Use Cross Validation to find best parameters

set.seed(19)
nb.train.control <- trainControl(method = "cv", number = 3)

nb <- train(G3_bin~.,
                math.train,
                 method = "naive_bayes",
                 trControl = nb.train.control)
plot(nb)

nb.grid <- expand.grid(laplace = seq(1, 3, 1), 
            usekernel = TRUE,
           adjust = seq(1, 3, 1))

nb.tune <- train(G3_bin~.,
                math.train,
                 method = "naive_bayes",
                 tuneGrid = nb.grid,
                 trControl = nb.train.control)

plot(nb.tune)

nb.tune.pred <- predict(nb.tune, newdata=math.test)

confusionMatrix(nb.tune.pred, math.test$G3_bin, mode = "prec_recall")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  A  B  C  D  F
##          A  0  0  0  0  0
##          B  1  5  0  0  0
##          C  1  4 12  4  1
##          D 10  9  6 22  9
##          F  0  0  0  4 29
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5812          
##                  95% CI : (0.4864, 0.6718)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 3.418e-08       
##                                           
##                   Kappa : 0.4389          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: F
## Precision                  NA  0.83333   0.5455   0.3929   0.8788
## Recall                 0.0000  0.27778   0.6667   0.7333   0.7436
## F1                         NA  0.41667   0.6000   0.5116   0.8056
## Prevalence             0.1026  0.15385   0.1538   0.2564   0.3333
## Detection Rate         0.0000  0.04274   0.1026   0.1880   0.2479
## Detection Prevalence   0.0000  0.05128   0.1880   0.4786   0.2821
## Balanced Accuracy      0.5000  0.63384   0.7828   0.6713   0.8462

SVM

SVM can take numeric/nominal variables
#### set up Pass/Fail classes

math.svm <- math.num %>% 
  mutate(G1_bin = ifelse(
    G1 < 10, "Fail", "Pass")) %>%
  mutate(G2_bin = ifelse(
    G2 < 10, "Fail", "Pass"))  %>%
  mutate(G3_bin = ifelse(
    G3 < 10, "Fail", "Pass")) 

math.svm <- math.svm %>%
  select(-c(G1,G2,G3))
preProc <- preProcess(math.svm, method = c("scale","nzv"))
math.svm.scale <- predict(preProc, newdata=math.svm)
math.svm.scale <- math.svm.scale %>% mutate_if(is.character, as.factor)
math.svm <- math.assoc %>%
  select(-c(G1,G2,G3))

Split the data into train (70%) and test(30%)

set.seed(19)
inTraining <- createDataPartition(math.svm.scale$G3_bin, p = 0.7, list = FALSE)
math.train <- math.svm.scale[inTraining,]
math.test <- math.svm.scale[-inTraining,]

Linear kernel

set.seed(19)
svm.model1 <- svm(G3_bin~., data = math.train, type = "C-classification", kernel = "linear")
svm.pred <- predict(svm.model1, newdata=math.test)
confusionMatrix(svm.pred, math.test$G3_bin, mode = "prec_recall")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Fail Pass
##       Fail   38    4
##       Pass    1   75
##                                           
##                Accuracy : 0.9576          
##                  95% CI : (0.9039, 0.9861)
##     No Information Rate : 0.6695          
##     P-Value [Acc > NIR] : 1.54e-14        
##                                           
##                   Kappa : 0.9061          
##                                           
##  Mcnemar's Test P-Value : 0.3711          
##                                           
##               Precision : 0.9048          
##                  Recall : 0.9744          
##                      F1 : 0.9383          
##              Prevalence : 0.3305          
##          Detection Rate : 0.3220          
##    Detection Prevalence : 0.3559          
##       Balanced Accuracy : 0.9619          
##                                           
##        'Positive' Class : Fail            
## 
  • Observations:
  • Accuracy : 0.9576

Polynomial kernel

set.seed(19)
svm.model1 <- svm(G3_bin~., data = math.train, type = "C-classification", kernel = "polynomial")
svm.pred <- predict(svm.model1, newdata=math.test)
confusionMatrix(svm.pred, as.factor(math.test$G3_bin), mode = "prec_recall")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Fail Pass
##       Fail    6    2
##       Pass   33   77
##                                           
##                Accuracy : 0.7034          
##                  95% CI : (0.6123, 0.7839)
##     No Information Rate : 0.6695          
##     P-Value [Acc > NIR] : 0.2486          
##                                           
##                   Kappa : 0.1609          
##                                           
##  Mcnemar's Test P-Value : 3.959e-07       
##                                           
##               Precision : 0.75000         
##                  Recall : 0.15385         
##                      F1 : 0.25532         
##              Prevalence : 0.33051         
##          Detection Rate : 0.05085         
##    Detection Prevalence : 0.06780         
##       Balanced Accuracy : 0.56426         
##                                           
##        'Positive' Class : Fail            
## 
  • Observations:
  • Accuracy : 0.7034

Radial kernel

set.seed(19)
svm.model1 <- svm(G3_bin~., data = math.train, type = "C-classification", kernel = "radial")
svm.pred <- predict(svm.model1, newdata=math.test)
confusionMatrix(svm.pred, as.factor(math.test$G3_bin), mode = "prec_recall")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Fail Pass
##       Fail   28    3
##       Pass   11   76
##                                          
##                Accuracy : 0.8814         
##                  95% CI : (0.809, 0.9336)
##     No Information Rate : 0.6695         
##     P-Value [Acc > NIR] : 9.905e-08      
##                                          
##                   Kappa : 0.7172         
##                                          
##  Mcnemar's Test P-Value : 0.06137        
##                                          
##               Precision : 0.9032         
##                  Recall : 0.7179         
##                      F1 : 0.8000         
##              Prevalence : 0.3305         
##          Detection Rate : 0.2373         
##    Detection Prevalence : 0.2627         
##       Balanced Accuracy : 0.8400         
##                                          
##        'Positive' Class : Fail           
## 
  • Observations:
  • Accuracy : 0.8814

Use k-fold cross-validation technique to train model

Linear kernel

# Set up 3-fold cross validation procedure
train_control <- trainControl( method = "cv",   number = 3)

# Tune the model - find the optimal C
set.seed(19)
svmGrid <- expand.grid(C = seq(1,10,by=1))
svm.caret <- train(G3_bin~., data = math.train, method="svmLinear", tuneGrid = svmGrid,
                  trControl = train_control)

# Visualize the tuning result
plot(svm.caret)

# Validation
set.seed(19)
svm.pred<- predict(svm.caret, newdata = math.test, type="raw")

# Results
confusionMatrix(svm.pred,math.test$G3_bin, mode = "prec_recall")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Fail Pass
##       Fail   37    5
##       Pass    2   74
##                                           
##                Accuracy : 0.9407          
##                  95% CI : (0.8816, 0.9758)
##     No Information Rate : 0.6695          
##     P-Value [Acc > NIR] : 1.179e-12       
##                                           
##                   Kappa : 0.8685          
##                                           
##  Mcnemar's Test P-Value : 0.4497          
##                                           
##               Precision : 0.8810          
##                  Recall : 0.9487          
##                      F1 : 0.9136          
##              Prevalence : 0.3305          
##          Detection Rate : 0.3136          
##    Detection Prevalence : 0.3559          
##       Balanced Accuracy : 0.9427          
##                                           
##        'Positive' Class : Fail            
## 
  • Observations:
  • Accuracy : 0.9407

Polynomial kernel

# Set up 3-fold cross validation procedure
train_control <- trainControl( method = "cv",   number = 3)

# Tune the model
set.seed(19)
svmGrid <- expand.grid(degree=seq(2,5,by=1),scale= c(0.01,0.001) , C = seq(1,10,by=1))
svm.caret <- train(G3_bin~., data = math.train, method="svmPoly", tuneGrid = svmGrid,
                  trControl = train_control)

# Visualize the tuning result
plot(svm.caret)

# Validation
set.seed(19)
svm.pred<- predict(svm.caret, newdata = math.test, type="raw")

# Results
confusionMatrix(svm.pred,math.test$G3_bin, mode = "prec_recall")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Fail Pass
##       Fail   38    5
##       Pass    1   74
##                                           
##                Accuracy : 0.9492          
##                  95% CI : (0.8926, 0.9811)
##     No Information Rate : 0.6695          
##     P-Value [Acc > NIR] : 1.461e-13       
##                                           
##                   Kappa : 0.888           
##                                           
##  Mcnemar's Test P-Value : 0.2207          
##                                           
##               Precision : 0.8837          
##                  Recall : 0.9744          
##                      F1 : 0.9268          
##              Prevalence : 0.3305          
##          Detection Rate : 0.3220          
##    Detection Prevalence : 0.3644          
##       Balanced Accuracy : 0.9555          
##                                           
##        'Positive' Class : Fail            
## 
  • Observations:
  • Accuracy : 0.9492

Radial kernel

# Set up 3-fold cross validation procedure
train_control <- trainControl( method = "cv",   number = 3)

# Tune the model
set.seed(19)
svmGrid <- expand.grid(sigma=c(0.005, 0.01,0.001) , C = seq(1,10,by=1))
svm.caret <- train(G3_bin~., data = math.train, method="svmRadial", tuneGrid = svmGrid,
                  trControl = train_control)

# Visualize the tuning result
plot(svm.caret)

# Validation
set.seed(19)
svm.pred<- predict(svm.caret, newdata = math.test, type="raw")

# Results
confusionMatrix(svm.pred,math.test$G3_bin, mode = "prec_recall")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Fail Pass
##       Fail   38    5
##       Pass    1   74
##                                           
##                Accuracy : 0.9492          
##                  95% CI : (0.8926, 0.9811)
##     No Information Rate : 0.6695          
##     P-Value [Acc > NIR] : 1.461e-13       
##                                           
##                   Kappa : 0.888           
##                                           
##  Mcnemar's Test P-Value : 0.2207          
##                                           
##               Precision : 0.8837          
##                  Recall : 0.9744          
##                      F1 : 0.9268          
##              Prevalence : 0.3305          
##          Detection Rate : 0.3220          
##    Detection Prevalence : 0.3644          
##       Balanced Accuracy : 0.9555          
##                                           
##        'Positive' Class : Fail            
## 
  • Observations:
  • Accuracy : 0.9492

Random Forest

Random Forest can take can take numeric/nominal variables

Data with five buckets

math.rf <- math.num %>% 
  mutate(G1_bin = ifelse(
    G1 < 10, "F", ifelse(
      G1 < 12, "D", ifelse(
        G1 < 14, "C", ifelse(
          G1 < 16, "B", "A"  )  )  )  )  ) %>%
  mutate(G2_bin = ifelse(
    G2 < 10, "F", ifelse(
      G2 < 12, "D", ifelse(
        G2 < 14, "C", ifelse(
          G2 < 16, "B", "A"  )  )  )  )  ) %>%
  mutate(G3_bin = ifelse(
    G3 < 10, "F", ifelse(
      G3 < 12, "D", ifelse(
        G3 < 14, "C", ifelse(
          G3 < 16, "B", "A"  )  )  )  )  )

math.rf <- math.rf %>%
  select(-G1,-G2,-G3)
preProc <- preProcess(math.rf, method = c("scale","nzv"))
math.rf.scale <- predict(preProc, newdata=math.rf)
math.rf.scale <- math.rf.scale %>% mutate_if(is.character, as.factor)

Split the data into train (70%) and test(30%)

set.seed(19)
inTraining <- createDataPartition(math.rf.scale$G3_bin, p = 0.7, list = FALSE)
math.train <- math.rf.scale[inTraining,]
math.test <- math.rf.scale[-inTraining,]
set.seed(19)
rf.model <- randomForest(G3_bin~., data = math.train, ntree = 500)

rf.model.pred <- predict(rf.model, newdata=math.test)

confusionMatrix(rf.model.pred, math.test$G3_bin)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  A  B  C  D  F
##          A 11  1  0  0  0
##          B  1 12  2  0  0
##          C  0  3  9  4  0
##          D  0  1  7 20  3
##          F  0  1  0  6 36
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7521          
##                  95% CI : (0.6638, 0.8273)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6731          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: F
## Sensitivity           0.91667   0.6667  0.50000   0.6667   0.9231
## Specificity           0.99048   0.9697  0.92929   0.8736   0.9103
## Pos Pred Value        0.91667   0.8000  0.56250   0.6452   0.8372
## Neg Pred Value        0.99048   0.9412  0.91089   0.8837   0.9595
## Prevalence            0.10256   0.1538  0.15385   0.2564   0.3333
## Detection Rate        0.09402   0.1026  0.07692   0.1709   0.3077
## Detection Prevalence  0.10256   0.1282  0.13675   0.2650   0.3675
## Balanced Accuracy     0.95357   0.8182  0.71465   0.7701   0.9167
# Feature analysis
varImpPlot(rf.model, n.var = 10)

  • Observations:
  • Accuracy : 0.7521

Use k-fold cross-validation technique to train model

# Set up 3-fold cross validation procedure
train_control <- trainControl( method = "cv",   number = 3)

# Tune the model - find the optimal k
set.seed(19)
rfGrid <- expand.grid(mtry = seq(1,10,by=1))
rf.caret <- train(G3_bin~., data = math.train, method="rf", tuneGrid = rfGrid,
                  trControl = train_control)

# Visualize the tuning result
plot(rf.caret)

# Validation
set.seed(19)
rf.pred<- predict(rf.caret, newdata = math.test, type="raw")

# Results
confusionMatrix(rf.pred, math.test$G3_bin)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  A  B  C  D  F
##          A 12  1  0  0  0
##          B  0 13  1  0  0
##          C  0  4 12  4  0
##          D  0  0  5 20  2
##          F  0  0  0  6 37
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8034          
##                  95% CI : (0.7198, 0.8711)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7421          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: F
## Sensitivity            1.0000   0.7222   0.6667   0.6667   0.9487
## Specificity            0.9905   0.9899   0.9192   0.9195   0.9231
## Pos Pred Value         0.9231   0.9286   0.6000   0.7407   0.8605
## Neg Pred Value         1.0000   0.9515   0.9381   0.8889   0.9730
## Prevalence             0.1026   0.1538   0.1538   0.2564   0.3333
## Detection Rate         0.1026   0.1111   0.1026   0.1709   0.3162
## Detection Prevalence   0.1111   0.1197   0.1709   0.2308   0.3675
## Balanced Accuracy      0.9952   0.8561   0.7929   0.7931   0.9359
  • Observations:
  • Accuracy : 0.8034

Random Forest

Random Forest can take can take numeric/nominal variables

Data with fail and pass bucket

math.rf <- math.num %>% 
  mutate(G1_bin = ifelse(
    G1 < 10, "Fail", "Pass")) %>%
  mutate(G2_bin = ifelse(
    G2 < 10, "Fail", "Pass"))  %>%
  mutate(G3_bin = ifelse(
    G3 < 10, "Fail", "Pass")) 
math.rf <- math.rf %>%
  select(-c(G1,G2,G3))
preProc <- preProcess(math.rf, method = c("scale","nzv"))
math.rf.scale <- predict(preProc, newdata=math.rf)
math.rf.scale <- math.rf.scale %>% mutate_if(is.character, as.factor)

Split the data into train (70%) and test(30%)

set.seed(19)
inTraining <- createDataPartition(math.rf.scale$G3_bin, p = 0.7, list = FALSE)
math.train <- math.rf.scale[inTraining,]
math.test <- math.rf.scale[-inTraining,]
rf.model <- randomForest(G3_bin~., data = math.train, ntree = 500)

rf.model.pred <- predict(rf.model, newdata=math.test)

confusionMatrix(rf.model.pred, math.test$G3_bin, mode = "prec_recall")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Fail Pass
##       Fail   37    4
##       Pass    2   75
##                                           
##                Accuracy : 0.9492          
##                  95% CI : (0.8926, 0.9811)
##     No Information Rate : 0.6695          
##     P-Value [Acc > NIR] : 1.461e-13       
##                                           
##                   Kappa : 0.8866          
##                                           
##  Mcnemar's Test P-Value : 0.6831          
##                                           
##               Precision : 0.9024          
##                  Recall : 0.9487          
##                      F1 : 0.9250          
##              Prevalence : 0.3305          
##          Detection Rate : 0.3136          
##    Detection Prevalence : 0.3475          
##       Balanced Accuracy : 0.9490          
##                                           
##        'Positive' Class : Fail            
## 
# Feature analysis
varImpPlot(rf.model, n.var = 15)

  • Observations:
  • Accuracy : 0.9492

Use k-fold cross-validation technique to train model

# Set up 3-fold cross validation procedure
train_control <- trainControl( method = "cv",   number = 3)

# Tune the model - find the optimal k
set.seed(19)
rfGrid <- expand.grid(mtry = seq(1,10,by=1))
rf.caret <- train(G3_bin~., data = math.train, method="rf", tuneGrid = rfGrid,
                  trControl = train_control)

# Visualize the tuning result
plot(rf.caret)

# Validation
set.seed(19)
rf.pred<- predict(rf.caret, newdata = math.test, type="raw")

# Results
confusionMatrix(rf.pred, math.test$G3_bin, mode = "prec_recall")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Fail Pass
##       Fail   38    5
##       Pass    1   74
##                                           
##                Accuracy : 0.9492          
##                  95% CI : (0.8926, 0.9811)
##     No Information Rate : 0.6695          
##     P-Value [Acc > NIR] : 1.461e-13       
##                                           
##                   Kappa : 0.888           
##                                           
##  Mcnemar's Test P-Value : 0.2207          
##                                           
##               Precision : 0.8837          
##                  Recall : 0.9744          
##                      F1 : 0.9268          
##              Prevalence : 0.3305          
##          Detection Rate : 0.3220          
##    Detection Prevalence : 0.3644          
##       Balanced Accuracy : 0.9555          
##                                           
##        'Positive' Class : Fail            
## 
  • Observations:
  • Accuracy : 0.9492

Conclusion

  1. What are the fundamental factors that will affect students’ performance on their final grade?

The largest indicators for a student’s performance at the end of the course are his or her performance in the previous two terms, G1 and G2. Other factors that have large impacts on these scores are the number of absences, the number of previously failed courses, the mother’s and/or father’s education level, and the job held by the mother.

  1. Which factors influence poor performance on the final grade the most?

If a student fails a course and does not have support from the school via tutoring or afterschool help, the student is likely to fail again. As the number of failures increases, the number of students below the Pass/Fail line on the plot increases.

Note that of the students that failed a course, the majority fall into the no School Support category. A factor that can have a large impact on each grade period is School Support.

  1. What would be the best way to improve student scores on their final grade? The best way to improve a student’s grade is to help him or her throughout the course. If students are successful in G1 and G2, it is highly probable they will be successful for the end of the course, G3. Students who had no access to school support systems have higher failure rates. This in turn will help the students to pass a course the first time they take it.

Next Step

Gather data from more students so that the models are better able to predict student performance using variables other than previous scores. ***